Apparatus and method for structure-based prediction of amino acid sequences

ABSTRACT

The present invention provides methods and apparatus for analyzing a protein structure.

FIELD OF THE INVENTION

The present invention relates to the field of genomics as well as to thefield of protein engineering, in particular, an apparatus and method todetermine the amino acid sequence space accessible to a given proteinstructure and to use this information (i) to align an amino acidsequence with a protein structure in order to identify a potentialstructural relationship between said sequence and structure, which isknown in the art as fold recognition and (ii) to generate new amino acidsequences which are compatible with a protein structure, which is knownin the art as protein design.

BACKGROUND OF THE INVENTION

Within the field of bioinformatics there is an emerging need forso-called data mining tools. These tools aim at unraveling relationshipsthat connect more elementary data components such as oligo-nucleotideand amino acid sequences, single-nucleotide polymorphisms (hereinafterreferred as SNP), expression profiles, biochemical properties,structural and functional characterization, and so on. Theserelationships represent knowledge that can be used to build so-calledknowledge databases that offer a structured description (e.g. viaannotation mechanisms) of the more elementary data components such asabove defined. Knowledge databases will increasingly play a key role inthe biotechnology industry since they form an inference basis from whichthe behavior of biological systems can be understood, predicted andpossibly controlled via proper biotechnological engineering.

The need for data mining tools is further increased by the availabilityof rich, well annotated sequence databases which offer an in silicobasis for new gene hunting approaches. Here the determination ofsequence-to-structure relationships is playing a crucial role, also forthose who are essentially not interested in structural information.Indeed, via sequence-to-structure relationships, often distantsimilarities with other proteins can be found. Such similarities areunlikely to be found by direct sequence comparisons if the sequenceidentity drops below the 25% level, according to R. F. Doolittle in Urfsand Orfs. A primer on how to analyze derived amino acid sequences (1986)at University Science Books, Mill Valey (Calif.) and Sander et al. inProteins (1991) 9:56-68.

Conversely, structure-to-sequence relationships are also rapidly gainingattention for several reasons. Firstly, the available experimental basishas recently evolved towards a mature stage, since the number of knownprotein structures has now reached over 12,000. Depending on theclassification method, these structures represent about 600 to 700unique folds, i.e. about 60 to 70% of the estimated total number ofabout 1000 possible folds adopted by naturally-occurring proteins,according to C. Chothia in Nature (1992) 357:543-544. Secondly, theavailable technological basis has also made a giant leap since newmethods for side-chain positioning (according to M. Vásquez in Curr.Opin. Struct. Biol. (1996) 6:217-221), loop modeling (according toSudarsanam et al. in Protein Sci. (1995) 4:1412-1420), inverse folding(according to Bowie et al. in Science (1991) 253:164-170), proteindesign (according to Dahiyat et al. in Science (1997) 278:82-87) and soon continuously appear in the literature, which methods also benefitfrom computer devices becoming more and more powerful. Third, thescientific community has started recognizing the potential ofstructure-based protein engineering as can be inferred from the growingnumber of companies in this field and the success of initiatives likethe Critical Assessment of Techniques for Protein Structure Prediction(CASP) contests (http://PredictionCenter.11nl.gov/) and the structuralgenomics initiative (http://www.nigms.nih.gov/funding/psi.html).

Tools which are able to link a given amino acid sequence to the correctthree-dimensional (hereinafter referred as 3D) fold or compute the aminoacid sequences that are compatible with a given protein structure arevery valuable. Given the myriad of world-wide known sequences andstructures, distributed over different databases, some crucialparameters for determining the success of any such method are predictionaccuracy and computational speed.

The present invention relates to a novel approach to link one or moreamino acid sequences with one or more protein 3D structures and viceversa. More particularly, the invention concerns a new method preservingthe accuracy of the most accurate atom-based modeling techniquescurrently known while avoiding the bottleneck-problem known as the“combinatorial substitution problem”, thereby gaining several orders ofmagnitude in computational speed.

The fundamental problem to be addressed by the present invention relatesto multiple combined substitutions in a protein and can be illustratedby the following example. We shall consider a protein X for which the 3Dstructure has been experimentally determined. This structure is calledthe wild-type (WT) structure and the amino acid sequence of protein X iscalled the WT sequence. We shall further assume that X is an enzyme usedin an industrial process and that there is interest in constructing amodified protein X′ showing a considerably higher thermal stability thanthe wild type X, while preserving the latter's functionality. While suchproblems have often been addressed by introducing point mutations (i.e.the substitution of single amino acid residues at selected positions),it has been shown that the simultaneous mutation of multiple amino acidresidues is usually more efficient, according to InternationalApplication Publication No. WO 98/47089. On the other hand the latterapproach, known as protein design, is extremely complicated in view ofthe extremely high number of a priori possible combinations of singleresidue mutations. For example if the aforementioned task of thermallystabilizing protein X was limited to substitutions in the core of aprotein (i.e. the internal part of a protein where strictly buriedresidues are located) containing e.g. 15 residues, and if only 10(preferably apolar) residue types were considered, there would still be10¹⁵ a priori possible solutions. Each such solution represents an aminoacid sequence which is different from that of protein X and is, inprinciple, a candidate sequence for protein X′. In order to predictwhich sequences are more stable than the WT sequence, one could apply afast molecular modeling program, predict the 3D-structure of eachsequence, compute the potential or free energy of each modeled structureand store the best results. However, clearly no modeling program is atpresent fast enough to complete this task in an reasonable period oftime.

The former example illustrates the main bottleneck problem in proteindesign, which is known in the art as the “combinatorial substitutionproblem” and which arises from the fact that different individualsubstitutions in a protein are, in principle, not independent as shownin FIG. 1 a. The energy of a protein comprising a double substitutionmay not be simply calculated from the structures of two single mutants,because both mutated amino acid residues are “unaware” from each otherin the single mutants, i.e. the single mutant structures lack anyinformation about the interaction between both substituted amino acidresidues in the double mutant. Consequently the simultaneousintroduction of two or more substitutions into the protein to be modeledmay lead to either favorable or unfavorable interactions which cannot bederived from the structure of single mutants. When performing suchmultiple substitutions, some information about the pair-wiseinteractions between the removed WT residues and the introduced mutantresidues should therefore be taken into account.

On the other hand, there are also cases where combinatorial effects donot play a significant role, i.e. where single substitutions areindependent from each other and where individual substitution energiesare additive as shown in FIG. 1 b. Moreover, there is also experimentalevidence, at least in some cases, that the energetic contributions ofdifferent individual mutations are additive according to A. Ferhst inEnzyme Structure and Mechanism (1985), ISBN 0-7167-1614-3. This is sofor example when two individual substitutions are separated in space ina manner such that the pair-wise interaction between both residues isnegligibly small for both the WT residues and the substituted residues.However this no-interaction condition is not sufficient to conclude thattwo substitutions may be considered independently. Indeed, although tworesidues do not directly interact with each other, they may still becoupled through conformational adaptation of residues in their vicinityaccording to FIG. 1 c. In general, this type of coupling depends on the“plasticity” of the protein (plasticity: the ability to cushionconformational changes). Non-direct, conformational coupling may forinstance occur when two simultaneously substituted residues leave anintermediate residue in a situation where no conformational solutionexists which leads to favorable energetic interactions with both mutatedresidues. Such an intermediate residue could be considered as“non-plastic” with respect to the concerned pair of substitutedresidues.

Going back to the previous example of the enzyme X needing to bestabilized, this means that the modeling task cannot be performed bysimply proposing a molecule X′ comprising a set of substitutions whichhave each been modeled independently in the WT structure. On the otherhand, a mere combinatorial enumeration of all possible global structuresis extremely inefficient from a computational perspective and can beconsidered as undue work with respect to the primary goal of predictingan enzyme X′ having an optimal thermal stability.

However, several authors tackled the side-chain positioning problem inproteins by means of a combinatorial approach or equivalent method. Forinstance, the Dead-End Elimination (DEE) method takes into account bothside-chain/template and side-chain/side-chain interactions and uses amathematically rigorous criterion to eliminate side-chain rotamers whichdo not belong to the Global Minimum Energy Conformation (GMEC). Sincethe DEE routines usually do not converge towards a unique structure, acombinatorial end stage routine is needed to determine the GMEC. It wasthus demonstrated, namely by Gordon et al. in Fold. Des. (1999)7:1089-1098, and Wernisch et al. In J. Mol. Biol. (2000) 301:713-736that the combinatorial problem for multiple substitutions can bealleviated, by means of DEE, by eliminating a number of potentialsubstitutions which cannot be part of the globally best solution, i.e.the most stable protein molecule. Since eliminations occur at the singleor pair residue level, they result in a huge reduction of the searchspace since all combinations therewith are eliminated as well.Nevertheless, the DEE method is a valuable method to reduce thecomplexity of a system. The main disadvantages of the DEE method are (1)non-convergence towards a unique structure, as above-mentioned, (2) theinability to predict “second best” solutions and (3) a relatively weakperformance for large systems (i.e.

having more than about 30 residues), due to a mathematically stringentprocess which complicates the use of approximations.

One problem to be addressed by the present invention is to develop amethod rapidly and accurately predicting the compatibility of one ormore amino acid sequences with one or more protein 3D-structures. Therequirement of accuracy strongly pleads for the use of a detailed,atom-based force field (energy function) to compute the local and otherinteractions of a set of amino acid residues placed onto a 3D-structure.Furthermore, the above example showed that a correct positioning, usingsaid interaction energies, is severely hampered by the combinatorialproblem. On the other hand, all current methods which fully account forcombinatorial effects are in conflict with the second requirement of ahigh computational speed. This constitutes the major dilemma in proteinmodeling, more specifically in fold recognition and protein design.

In order to determine sequences that are reachable by a given proteinstructure and/or to identify the best-matching protein 3D-structure,given a particular sequence, one may use (1) prediction algorithmsrelying on statistically derived substitution matrices, (2) predictionalgorithms based on statistical descriptors, and (3) computation ofexplicit conformational energies, corrected for solvation effects.

In the approach using descriptors, a set of reference proteins (oftenreferred to as the learning set) is used to derive properties that canstatistically be used to judge the propensity for any given amino acidtype to be present in some given structural environment. For example, inthe original derivation of the 3D-profiles of Bowie et al. in Science(1991) 253:164-9, the environment of each residue is featured by (a) aburied side chain area, (b) a local secondary structure and (c) thepolarity of the environment. Using these features, the authorsrecognized 18 environmental classes. As a result, any given proteinstructure can be transformed into a linear string composed of the classidentifiers for each position along the protein's sequence. From adatabase of proteins of known structure, a so-called 3D-to-1D scoringtable is built to assign a score for finding each of the 20naturally-occurring amino acid types associated with each of theenvironmental classes. Using this scoring table, the best sequencealignment can then be computed against the string of environment classesfor a number of proteins.

This procedure initiated the so-called threading approach in inversefolding where the compatibility of candidate sequences with a givenscaffold is examined. As can be anticipated from the discretizationprocedure, wherein the complexity of a 3D environment is mapped onto alimited number of classes, the success of the threading approach issignificantly limited according to M. J. Sippl in J. Comp.-Aided Mol.Design (1993) 7:473-501 and Chiu et al. in Protein Eng. (1998) 11:749-752.

Global sequence signatures involving many residues may be detected but,in view of its statistical nature, the threading method will not beapplicable to accurately assess the sequence-structure compatibilityimplying e.g. perturbation effects wherein a given WT sequence is variedat a number of positions. In addition, even if a sequence is correctlypredicted to be compatible with a given structure, the threading methoditself is incapable of deriving an accurate 3D-structure of the sequencelaid over a given protein backbone structure.

In order to achieve accurate predictions, it is desirable to take intoaccount the full complexity of the 3D environment of each amino acidresidue. In the third approach wherein predictions are based oncomputational methods assessing the energetic of a system of proteinresidues, it is possible to compute the conformation of lowest energyand even the sequence of lowest energy for a system consisting of aprotein backbone and a set of amino acid residues which are allowed tovary in conformation and/or amino acid type. Most often, suchcomputations are based on advanced search techniques combined with adetailed model for atomic interactions. In addition to the bonded andnon-bonded interaction terms as used in molecular mechanics force fields(e.g. the CHARMM force field of Brooks et al. in J. Comput. Chem.

(1983) 4:187-217), these energy computations may also assess thehydrophobic effect through a suitable solvation potential such as e.g.the pair-wise solvation potential disclosed by Street et al. in Fold.Des. (1998) 3:253-258, thereby correcting for the modeling in absence ofexplicit water molecules.

In order to determine whether a sequence is compatible with a givenprotein 3D-structure or scaffold, one could in principle exploredifferent alignments of the query sequence relative to the given proteinbackbone and search for alignments yielding low conformational energy,if any. However such a process suffers from several inconveniences.First, this process is very time consuming, especially if the givenquery sequence has to be scanned against a large database of proteinstructures, thereby searching for the best scaffold that is likely tomatch with the given query sequence. Indeed, for each protein a seriesof detailed computations need to be carried out at the atomic level:several alignments need to be explored and each time the conformation orsequence of lowest energy must be identified, which involves thesolution of the so-called combinatorial problem whereby the bestsolution must be found out of all possible combinations ofsingle-residue states. Consequently it is unlikely that suchcomputations can be carried out in a high-throughput mode. Second, thealignments must allow for gaps which are often located in the loopregions. However for evolutionary distant proteins the precise gaplocations and lengths are a priori unclear, which in turn necessitatesmultiple trial runs to match one sequence with one structure.

In addition, some important problems simply cannot be answered by thisthird approach, such as for instance the task of computing all possibleamino acid sequences that are compatible with a given protein scaffold(below a pre-set threshold, using a suitable scoring function), becauseit is at present impossible to generate all possible amino acidsequences and assess their compatibility within a reasonable computationtime.

As a summary, there is a stringent need for a fundamentally new methodin order to explore the sequence space that is compatible with a givenscaffold and to rapidly scan for structural compatibility a given querysequence against a database of know protein structures.

Glossary

Throughout the foregoing and following description of the invention, theabbreviations, acronyms and technical terms used shall have thefollowing meanings and definitions:

-   ALA, ARG, ASN, ASP, CYS, GLN, GLU, GLY, HIS, ILE, LEU, LYS, MET,    PHE, PRO, SER, THR, TRP, TYR, VAL: standard 3-letter code for the    naturally occurring amino acids (or residues) named, respectively:    alanine, arginine, asparagine, aspartic acid, cysteine, glutamine,    glutamic acid, glycine, histidine, isoleucine, leucine, lysine,    methionine, phenylalanine, proline, serine, threonine, tryptophan,    tyrosine, valine.-   ASA: Accessible Surface Area-   DEE: Dead-End Elimination-   GMEC: Global Minimum Energy Conformation-   PDB: Protein Data Bank-   RMSD: Root-Mean-Square Deviation

Accessible Surface Area (ASA)

-   -   The surface area of (part of) a molecular structure, traced out        by the mid-point of a probe sphere of radius ˜1.4 Å rolling over        the surface.        Amino Acid, Amino Acid residue    -   Group of covalently bonded atoms for which the chemical formula        can be written as ⁺H₃N—CHR—COO⁻, where R is the side group of        atoms characterizing the amino acid type. The central carbon        atom in the main-chain portion of an amino acid is called the        alpha-carbon (C_(α)). The first carbon atom of a side chain,        covalently attached to the alpha-carbon, if any, is called the        beta-carbon (C_(β)). An amino acid residue is an amino acid        comprised in a chain composed of multiple residues and which can        be written as —HN—CHR—CO—.

Backbone

-   -   See: Main chain

Conformational Space

-   -   The multidimensional space formed by considering (the        combination of) all possible conformational degrees of freedom        of a protein, where one degree of freedom usually refers to the        torsional rotation around a single covalent bond. In a broader        sense, it also includes possible sequence variation by extending        the definition of a rotamer to a particular amino acid type in a        discrete conformation.

Dead-End Elimination (DEE)

-   -   A method to reduce the conformational space of a side-chain        modeling task by eliminating individual side-chain rotamers        which are incompatible with the system on the basis of one or        more rigorous mathematical criteria. By a variant of this        method, pairs of rotamers belonging to different residues can        also be eliminated.

Energy Function, Force Field

-   -   The function, depending on the atomic positions and chemical        bonds of a molecule, which allows to obtain a quantitative        estimate of the latter's potential or free energy.

Flexible Side Chain, Rotatable Side Chain, Modeled Side Chain

-   -   A side chain modeled during performance of the method.

Global Energy

-   -   The total energy of a global structure (e.g. the GMEC) which may        be calculated in accordance with the applied energy function.

Global Minimum

-   -   The globally optimal solution of a protein consisting of a        number of elements (e.g. residue positions) which are a priori        susceptible to variation (e.g. using the concept of rotamers).

Global Minimum Energy Conformation, GMEC

-   -   The global conformation of a protein which has the lowest        possible global energy according to the applied energy function,        rotamer library and template structure.

Global Structure, Global Conformation

-   -   The molecular structure of a uniquely defined protein, e.g.        represented as a set of cartesian co-ordinates for all atoms of        the protein.

Homology Modeling

-   -   The prediction of the molecular structure of a protein on the        basis of a template structure.

Main Chain, Backbone

-   -   The part of a protein consisting of one or more near-linear        chains of covalently bonded atoms, which can be written as        (—NH—CH(—)—CO—)_(n). Herein, the C_(β)-atom is considered as a        pseudo-backbone atom and is included in the template because its        atomic position is unambiguously defined by the positions of the        true backbone atoms.

Protein Data Bank (PDB)

-   -   The single international repository for the processing and        distribution of 3-D macromolecular structure data primarily        determined experimentally by X-ray crystallography and nuclear        magnetic resonance.

Protein Design

-   -   That part of molecular modelling which aims at creating protein        molecules with altered/improved properties like stability,        immunogenicity, solubility and so on.

Root-Mean-Square Deviation, RMSD

-   -   The average quadratic deviation between the atomic positions,        stored in two sets of co-ordinates, for a collection of atoms.

Rotamer, Rotameric State

-   -   An unambiguous discrete structural description of an amino acid        side chain, whether preferred or not. According to this        definition, which is extended over the usual definitions by        considering less preferred states, it is not required that        rotamers be interchangeable via rotation only.

Rotamer Library

-   -   A table or file comprising the necessary and sufficient data to        define a collection of rotamers for different amino acid types.

Rotamer/Template (Interaction) Energy

-   -   The sum of all atomic interactions, as calculated using the        applied energy function, both within a given side-chain rotamer        at a specific position in a protein structure and between the        rotamer and the template.

Sequence Space

-   -   The multidimensional space formed by considering the combination        of all (or a set of) possible amino acid substitutions at all        (or a set of) residue positions in a protein.

Side Chain, Side Group

-   -   Group of covalently bonded atoms, attached to the main chain        portion of an amino acid (residue).

Side-Chain/Side-Chain Pair (Wise) (Interaction) Energy

-   -   The sum of all atomic interactions, as calculated using the        applied energy function, between two side-chain rotamers at two        specific positions in a protein structure.

Steepest Descent Energy Minimization

-   -   Simple energy minimization method, belonging to the class of        gradient methods, only requiring the evaluation, performed in        accordance with a particular energy function, of the potential        energy and the forces (gradient) of a molecular structure.

Template

-   -   All atoms in a modelled protein structure, except those of the        rotatable side chains.

SUMMARY OF THE INVENTION

The present invention provides a totally new approach to handlesequence-to-structure and structure-to-sequence relationships:

-   (i) in a first aspect, the present invention comprises the    systematical analysis of known protein structures in terms of    individual contributions of single, pairs and occasionally also    multiplets of amino acid residues to the global energy of a protein    comprising any of these residues, and-   (ii) in a second aspect, the present invention comprises a method of    combining the single, pair and multiplet contributions in order to    estimate the global energy of a protein structure comprising a    larger number of substitutions.

A main feature of the present invention is that energetic values forsingle, pairs and multiplets of residues in different conformations canbe pre-computed once and stored on an appropriate data storage devicefor all, or a subset of all, known protein 3D-structures. These data canafterwards be retrieved from the storage device when needed for thecalculation of global energies. The same data elements can be used overand again in various experiments wherein the compatibility of differentsequences with the same 3D-structure is investigated. The presentinvention allows to (i) produce data in a fast and consistent way and(ii) use the produced data to calculate accurate global energies ofproteins, which have undergone a relatively large number of mutationscompared to the original structure.

A central idea of the present invention is to analyze known proteinstructures by computing a quantitative measure reflecting the energeticcompatibility of all naturally-occurring and synthetic amino acids ofinterest at each residue position of interest in the structure. In asimilar way, the energetic compatibility of selected residue pairs andmultiplets may be computed as well. Upon generating such compatibilitydata, it is possible to take into account conformational adaptationeffects for residues surrounding the considered substituted residue(s),as if the same substitutions were introduced in vitro by proteinengineering techniques.

The said energetic compatibility data may be generated systematicallyfor different protein structures and stored concisely using appropriatedata storage devices. Ideally, it should be sufficient to use onlypre-computed data for single residues and/or residue pairs in order toderive the compatibility of any sequence with a given protein structurewith sufficient accuracy. In such case, it would become possible toderive extended lists of structure-compatible sequences as well assequence-compatible structures arithmetically instead of by means ofsophisticated protein modeling.

The present invention thus first provides a method for analyzing aprotein structure, the method being executable in a computer under thecontrol of a program stored in the computer, and comprising thefollowing steps:

-   -   A. receiving a reference structure for a protein, whereby        -   the said reference structure forms a representation of a 3D            structure of the said protein, and        -   the protein consists of a plurality of residue positions,            each carrying a particular reference amino acid type in a            specific reference conformation, and        -   the protein residues are classified into a set of modeled            residue positions and a set of fixed residues, the latter            being included into a fixed template;    -   B. substituting into the reference structure of step (A) a        pattern, whereby        -   the said pattern consists of one or more of the modeled            residue positions defined in step (A), each carrying a            particular amino acid residue type in a fixed conformation,            and        -   the one or more amino acid residue types of the said pattern            are replacing the corresponding amino acid residue types            present in the reference structure;    -   C. optimizing the global conformation of the reference structure        of step (A) being substituted by the pattern of step (B),        whereby        -   a suitable protein structure optimization method based on a            function allowing to assess the quality of a global protein            structure, or any part thereof, is used in combination with            a suitable conformational search method, and        -   the said structure optimization method is applied to all            modeled residue positions defined in step (A) not being            located at any of the pattern residue positions defined in            step (B), and        -   the pattern and template residues are kept fixed;    -   D. assessing the energetic compatibility of the pattern defined        in step (B) within the context of the reference structure        defined in step (A) being structurally optimized in step (C)        with respect to the said pattern, by way of comparing the global        energy of the substituted and optimized protein structure with        the global energy of the non-substituted reference structure;        and    -   E. storing a value reflecting the said energetic compatibility        of the pattern together with information related to the        structure of the pattern in the form of an energetic        compatibility object.

By “fixed” in step (A), it is meant that the template structure may bevaried by at most about 1 Ångström in RMSD compared to the originalstructure.

For the purpose of the rotameric embodiment of the method, such asexplained hereinafter, the selected rotamer library may be eitherbackbone-dependent or backbone-independent. A backbone-dependent rotamerlibrary allows different rotamers depending on the position of theresidue in the backbone, e.g. certain leucine rotamers are allowed ifthe position is within an α helix whereas other leucine rotamers areallowed if the position is not in a α helix. A backbone-independentrotamer library uses all rotamers of an amino acid at every position,which is usually preferred when considering core residues, sinceflexibility in the core is important. However, backbone-independentlibraries are computationally more expensive and thus less preferred forsurface and boundary positions. The rotamer library may originate froman analysis of known protein structures or it may be generated bycomputational means, either on the basis of model peptides orspecifically for the amino acids considered at each of said residuepositions of said protein backbone.

For the purpose of step (C), the best possible global conformation ofthe protein system includes any conformation that can be appreciated bya person skilled in the art as being closely related to the optimum. Forthe purpose of step (E), storage of the data may be either explicit orimplicit, the latter being accomplished e.g. by advanced data orderingor by conventional compression methods.

Other aspects of the present invention include:

-   -   a data structure comprising at least an ECO obtainable by the        above-defined method, for instance in the form of a protein        sequence which is different from any known protein sequence.    -   a nucleic acid sequence encoding a protein sequence analyzed by        the method as above-defined.    -   an expression vector or a host cell comprising the nucleic acid        sequence as above-defined.    -   a pharmaceutical composition comprising a therapeutically        effective amount of a protein sequence generated by the method        (as above-defined) and a pharmaceutically acceptable carrier.    -   a method of treating a disease in a mammal, comprising        administering to said mammal a pharmaceutical composition such        as above-defined    -   a database comprising a set of ECO's in the form of a data        structure, e.g. stored on a computer readable medium;    -   a computing device adapted to carry out any of the methods of        the present invention;    -   a computer readable data carrier having stored thereon a        computer program capable of carrying out any of the methods of        the present invention.

The present invention, its characteristics and embodiments, will now bedescribed with reference to the following detailed description.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows three schematic representations of part of proteinstructures in their conformations of lowest energy.

FIG. 2 shows a schematic representation of the mechanism of patternintroduction (PI) and structure adaptation (SA) according to the presentinvention.

FIG. 3 illustrates the principle of additivity of adaptation (AA)according to the present invention.

FIG. 4 shows the result of the alignment procedure of humanalpha-lactalbumin against a single residue GECO profile for henlysozyme, according to the present invention.

FIG. 5 shows a correlation diagram between the energy obtained from theeffective modeling of 100 different triple mutations at residues 5, 30and 52 of the B1 domain of protein G and the energy calculated for thesame substitutions derived on the basis of GECO energies, according tothe present invention.

FIG. 6 shows a computer system suitable for use with the presentinvention.

DETAILED DESCRIPTION OF THE INVENTION

As previously indicated, the present invention consists of two mainaspects: first a new method including the ECO concept, secondly the useof such a method to assess the energy of global multiply substitutedprotein systems.

The first aspect of the present invention will be explained namely withreference to FIGS. 1 to 3 which are now described in detail.

In FIG. 1, the bold line symbolizes the backbone, whereas thin linesshow residue side chains in their energetically optimal conformation.Residue positions are numbered from 1 to 5. Different residue types arerepresented by lines of different length. Black lines are used for theWT structure and grey lines show substituted residues. Dotted lines showresidues before substitution or conformational adaptation. In FIG. 1 a,residues 1 and 5 have been substituted by larger residue types. Whileeach single substitution would be compatible with the remainder of thestructure, the double mutation leads to an incompatibility situation(symbolized by the dotted circle), thus both mutations cannot be modeledindependently. Conversely, FIG. 1 b shows a situation wherein twomutations are assumed to be independent: the single substitutions atresidues 1 and 4, as well as the double mutation do not change theirenvironment and both residues do not significantly interact with eachother. FIG. 1 c shows that two residues which do not directly interactmay still be coupled through conformational adaptation: the largerresidue type at residue position 2 necessitates (dotted circle at left)a conformational change (indicated by an arrow) of residue 3 which is inconflict with residue 4 (dotted circle at right).

In FIG. 2, the same notations are used as in FIG. 1. The left-mostdrawing represents the protein of interest in its reference structure(ref). The middle drawing shows the reference structure wherein a singleresidue pattern p is introduced at residue position 1. The dotted circleillustrates a sterical conflict with residue 3. After the SA step(right-most drawing), residues 3, 4 and 5 have rearranged from theirreference state (dotted lines) to a new, adapted conformation (fulllines) which overcomes the said sterical conflict. The energy of each ofthese structures is represented underneath each drawing, showing thatthe introduction of p into the reference structure first leads to ahigh-energy “intermediate” structure which can be relaxed towards astructure of lower energy. The difference between the energy of thereference structure, E_(ref), and that of the substituted and adaptedstructure, E(p), is the ECO energy, E_(ECO)(p), which can be consideredas a substitution energy.

In FIG. 3, the same notations are used as in FIG. 1. FIG. 3 a shows anexample wherein the AA principle is met, i.e. the same conformationaladaptations (symbolized by the arrows) are observed for the doublesubstitution (at right) as are occurring for both single substitutions(at left). FIG. 3 b shows an example wherein the AA principle is notmet.

Before describing the ECO concept and its possible applications, thefollowing prerequisites should be explained:

-   1. a definition of applicable protein structures (see STRUCTURE    DEFINITION);-   2. a definition of conformational freedom (see CONFORMATIONAL    FREEDOM);-   3. a description of applicable energy models (see ENERGY FUNCTION);-   4. a description of preferred modeling techniques which are able to    generate an ECO (see STRUCTURE OPTIMIZATION); and-   5. a definition of a reference structure, its relevance and some    preferred options (see REFERENCE STRUCTURE).

Next, the ECO concept is discussed in detail, both descriptively (seeDESCRIPTION OF THE ECO CONCEPT) and mathematically (see FORMALDESCRIPTION OF ECO).Then, a preferred approach to systematically deriveECO's is discussed (see SYSTEMATICAL PATTERN ANALYSIS (SPA) MODE). Next,some preferred operations on basic ECO's are elaborated (see GROUPEDECO'S). Finally, a preferred method to estimate the energy of multiplysubstituted protein structures from single and double ECO and/or GECOenergies is described (see ASSESSING GLOBAL ENERGIES FROM SINGLE ANDDOUBLE (G)ECO's).

Structure Definition

The present method to generate ECO's is applicable to protein moleculesfor which a 3D representation (or model or structure) is available. Morespecifically, the atomic coordinates of at least part of a protein mainchain, or backbone, must be accessible. Preferably a full main chainstructure and, more preferably, a full protein structure is available.In the event that the available protein structure is defective, someconventional modeling tools (e.g. the so-called spare parts approachdisclosed by Claessens et al. in Prot. Eng. (1989) 2:335-345 andimplemented e.g. by Deihaise et al. in J. Mol. Graph. (1984) 2:103-106,or other modeling techniques such as reviewed by Summers et al. inMethods in Enzymology (1991) 202:156-204 ?) may be applied to buildmissing atoms or loops. Depending on the force field used (see ENERGYMODEL), the coordinates of at least the heavy atoms (any atom other thanhydrogen), or heavy atoms together with polar hydrogen atoms, or allatoms, of at least a region around the amino acid residues of interestmust be known, where the said region should be sufficiently large toenable accurate energetic computations. Optionally ligands, such asco-factors, ions, water molecules and so on, or parts thereof may beincluded in the definition of the protein structure as well. A knownstructure for the side chains of a protein of interest is not anabsolute requirement since the method of the present invention includesa side-chain placement step. However, knowing starting coordinates forthe said side chains may be useful when the user should decide toenergy-minimize the starting structure or to keep sonic side chains in afixed conformation.

The origin of the 3D protein structure is relatively unimportant withinthe context of the present invention since the ECO concept remainsapplicable to any 3D-structure conforming to aforementioned criteria.Yet, the meaningfulness of generated ECO's (and, therefore, theusefulness of applications derived therefrom) is directly proportionalto the quality (accuracy) of the starting structure. Therefore, a numberof possible sources of atomic coordinates, in decreasing order ofpreference, are provided hereinafter: X-ray determined structures,NMR-determined structures, homology modeled structures, structuresresulting from de novo structure prediction.

In the preferred embodiment of the present invention, someatoms—generally referred to as “template atoms”—may be kept fixedthroughout the process of generating ECO's. Preferably, a templateconsists of all main-chain atoms, C_(β)-atoms and ligand atoms. A usefulbut more delicate option is to include particular side chains into thetemplate, preferably side chains lying far enough (as determined by askilled user) from the residue(s) involved in the ECO generation.

The starting 3D-structure of a protein of interest forms a first sourceof input data for the performance of the present method.

In the definition of a protein of interest, there are no limits to thelength of the chain (provided that at least two amino acids are linkedtogether) or to the number of chains. The definition of a protein asused herein also includes domains (e.g. functional, enzymatic or bindingdomains) as well as smaller fragments such as turns or loops. Byextension, the method may also be applied to molecules such as peptides,peptido-mimetic structures and peptoids.

With respect to the second aspect of the present invention, i.e. amethod to assess the energetic fitness of multiply substituted proteinson the basis of values for single residues and pairs, a startingstructure is of lesser importance. Indeed, the latter method preferablyuses pre-computed values from a database, rather than generating themwhen needed, although the latter possibility is not excluded by thepresent invention.

Conformational Freedom

The present method involves two different types of structure changes,i.e. amino acid substitutions and conformational changes, the latterbeing discussed in this section. The present method to generate ECO'scan be executed by allowing conformational changes either (i) in adiscrete rotameric space or (ii) in a continuous space. The rotamericapproach being preferred will be described first.

The concept of rotamers states that protein side chains (and evenresidue main chains) can adopt only a limited number of energeticallyfavorable conformational states, called rotamers, according to Ponder etal. in J. Mol. Biol. (1987) 193:775-791. The attractive aspect ofrotamers is that the theoretically continuous conformational space ofamino acid residues can be transformed into a discrete number ofrepresentative states on the basis of a rotamer library (for instancethe rotamer library described by De Maeyer et al. in Fold. & Des. (1997)2:53-66, comprising 859 elements for the 17 amino acids having rotatableside chains), thus considerably facilitating a thorough, systematicalanalysis of a protein structure. Furthermore, some energetic propertiesof individual rotamers within the context of a given protein structurecan be pre-calculated before any modeling action takes place. The samecan be done for pairwise interaction energies between rotamers ofdifferent residues. On the basis of individual (or “self”) and mutual(or “pair”) energies, a large number of rotamers and rotamercombinations can be identified as being incompatible with the scaffoldof the protein of interest. Another major advantage of the rotamerconcept is that search methods are known in the art to find theenergetically best possible global structure for the side chains of aprotein, given a fixed main chain, a rotamer library and an energyfunction. Examples of such methods known in the art are the variousembodiments of the DEE method such as disclosed by Desmet et al. inNature (1992) 356:539-542, Lasters et al. in Protein Eng. (1993)6:717-722, De Smet et al. in The Protein Folding problem and tertiarystructure prediction (1994) 307-337, Goldstein in Biophys. Jour. (1994)66:1335-1340 and De Maeyer et al. (cited supra). Irrespective of thesearch method, the use of rotamers allows to overcome local energeticbarriers by simply switching between different rotameric states.

Furthermore, it should be noted that ECO's refer to the structuralcompatibility of “patterns” that are grafted into a given proteinstructure. As used herein, a pattern consists, at the choice of theuser, of one or more residue positions, each carrying an amino acid typein a certain conformation. Pattern residues need not be covalentlylinked to each other, but they must be intact (at least the whole sidechain must be considered) and their 3D-structure must be unambiguous. Inthe preferred embodiment of the present invention, different ECO's maybe generated, where each ECO is an energetic compatibility objectdefined on one or more residue positions, each having assigned aspecific amino acid type and adopting a well-defined rotameric state.

The selection of rotamers occurs on the basis of a rotamer library, forinstance according to one of the three following possibilities: (i) theretrieval of side-chain rotamers from a backbone independent library,(ii) the retrieval of side-chain rotamers from a backbone dependentlibrary and (iii) the retrieval of combined main-chain/side-chainrotamers from a backbone dependent library. The two former possibilitiesnecessarily imply that the main-chain structure of a pattern residue istaken from the protein starting structure. The latter possibilityrequires an additional comparison between the local backboneconformation in the protein structure and all backbone conformations inthe library and the retrieval of a side-chain rotamer associated withthe closest matching backbone conformation and is therefore morecomplicated since it involves the insertion of the main-chain portion ofthe rotamer selected from the library with the backbone of the proteinstructure. This can be performed in many ways. For instance for isolatedpattern residues, it is possible to apply a least-squares fitting of thepeptide plains determined by the main-chain dihedral angles of theselected library rotamer onto those of the protein structure where therotamer is to be inserted. For consecutive pattern residues (e.g.loops), the method described in Brugel (Delhaise et al. in J. Mol.Graph. (1984) 2:103-106) can be used. In both cases, the method may befollowed by an energy refinement step using a gradient method, e.g. asteepest descent energy minimization. A common feature of the threepossible selection methods is that they result in a unique definition ofthe 3D structure of each pattern residue, their side-chain conformationsare selected from a rotamer library whereas their main-chainconformations are taken either from a library or from the proteinstarting structure.

Another aspect of the present invention concerns the conformationalfreedom of that part of the protein not belonging to the pattern ofinterest, i.e. the “pattern environment”. In this respect, the presentmethod includes a step wherein the pattern environment isconformationally adapted to the pattern itself and which essentiallyaims at an energetical optimization of intramolecular interactionswithin the pattern environment and between the latter and the patternatoms. Thus, we shall make a distinction between a rotameric and anon-rotameric embodiments, both subject of the present invention, sincethey have quite different characteristics. In the preferred rotamericembodiment, the accessible conformational freedom of the patternenvironment is determined by a rotamer library. Similarly to patternresidues, the rotamer library used for the non-pattern residues may beeither a side-chain or a combined main-chain/side-chain rotamer library.However, restricting the conformational space of the non-patternresidues to only side-chain rotamers (thus keeping fixed the main chain)is highly preferred because of its mathematical simplicity.

In contrast to the assignment of one unique rotameric conformation toeach of the pattern residues, the residues of the pattern environmenthave in principle access to all rotameric states known in the libraryfor the associated residue types. The total conformational space is thendefined as all possible combinations of rotamers for the non-patternresidues. The search method described in the section STRUCTUREOPTIMIZATION below then needs to find the energetically most favorablecombination.

As will be appreciated by those skilled in the art, several techniquescan be applied to restrict the number of rotamers for individualresidues in a protein structure, based on criteria related to localsecondary structure, local energy, solvent accessibility and so on. Inaddition, some residues of the pattern environment may be completelyrefrained from adaptation (this is equivalent to include such residuesinto the template, see STRUCTURE DEFINITION) in order to speed upcomputations. This however may lead to underestimate the structuralcompatibility of the pattern of interest. The user therefore has todecide whether the induced level of error remains acceptable in theapplication concerned. For example in a fold recognition application,where the compatibility of a long amino acid sequence with differentprotein folds is to be assessed, many single residues may be erroneouslypredicted as incompatible if the global score for a correct or “true”sequence-structure correlation remains distinguishable from theincorrect or “false” matches. It may even be preferred to generate ECO'swhile limiting adaptation to only those residues in immediate contactwith the pattern residue(s). To the contrary, should a user beinterested to test whether a given pattern (e.g. the binding site of areceptor, or any part thereof) can be designed into a scaffold ofinterest, then it is advisable to eliminate as many sources of error aspossible and therefore to retain a high number of adaptive residuesaround the said pattern. As a general rule, it is suggested to apply oneof many possible threshold-based methods directly or indirectly relatedto the distance between pattern and environment residues. For example aconvenient criterion may be to include in the set of adaptive residuesall residues j for which

min_(i) D(i,j)<S+n(i ^(a))+n(j ^(b))   (eq.1)

where i is a pattern residue, j is an environment residue, D(i,j) ispreferably equal to the distance between the C_(β)-atoms of i and j inthe protein structure, min_(i) D(i,j) is the closest such distance forall pattern residues i, S is a predefined threshold value (preferablyabout 6 Å but possibly lower or higher values for less or more criticalapplications, respectively), i^(a) and j^(b) are the residue types atpositions i and j, respectively, and n(i^(a)) and n(j^(b)) are measuresof the size of the side-chains at the respective positions i and j,where n is preferably equal to the number of heavy atoms (i.e allnon-hydrogen atoms) counted from the C_(β)-atom (included) along thelongest branch in the side chain. For example, if i^(a)=Ile, j^(b)=Pheand S=6Å, j would be included in the set of adaptive residues ifD(i,j)<6+3+5=14 Å.

Although less preferred, the present method to generate ECO's can alsobe performed without using the rotamer concept. Again a distinctionneeds to be made between the pattern residues and the patternenvironment. With respect to the conformation of the one or more patternresidues, a first approach is to “copy” a sub-structure from a known3D-structure and to graft it into the structure of the protein ofinterest using a suitable structure fitting method. This can be ofinterest, for example, in loop grafting or when attempting to substitutebinding site patterns. This fully matches the ECO concept since thedefinition of a pattern requires establishing a well-defined set ofresidue positions (e.g. a loop fragment), carrying a well defined aminoacid sequence (one residue type at each position) in a well-definedconformation (e.g. a known conformation). An alternative to assign aconformation to pattern residues is to randomly sample it from acontinuous torsion space around main-chain and/or side-chain dihedralangles using a random number generator. This approach does form afeasible alternative to investigate the compatibility of differentpatterns occupying the same residue positions and types while adoptingdifferent non-rotameric conformations.

The non-pattern residues form the subject of the second step of thepresent method, i.e. the adaptation step. Here also, a less preferredbut feasible non-rotameric approach is possible. The structuraladaptation of the pattern environment in a continuous conformationalspace will be driven by the selected optimization method, typically agradient method, in combination with the selected energy function,typically a potential or free energy function. Here, the accessiblespace is in principle nearly infinite since it includes all possiblecombinations of continuous torsion variations with bond angle bending,bond stretching, and so on. However, in practice, most continuousoptimization methods (such as e.g. steepest descent energy minimization)do cause only local changes in the structure and are prone to gettrapped into local optima instead of the global optimum. On the otherhand, gradual non-rotameric optimization methods advantageously providea simple solution to the problem of main-chain flexibility becauseduring structural optimization of a pattern environment, the proteinmain-chain can be optimized together with the side chains.

In conclusion, both a rotameric and non-rotameric method are suitablefor both pattern definition and adaptation of the pattern environment.Although all combinations are possible and useful in particularcircumstances, particularly preferred are the combination of rotamericpatterns with non-rotameric adaptation as well as combinations ofrotameric adaptation with rotameric or non-rotameric patterns ofinterest.

Energy Function

The most important element of an ECO is its energy value, which shouldbe interpretable as a quantitative measure of the structuralcompatibility of the ECO pattern within the context of the adaptedprotein structure. For this reason, the notion of “structuralcompatibility” may be considered equivalent to “energeticcompatibility”.

In principle, the method of the present invention may be performed withall energy functions commonly used in the field of protein structureprediction, provided that the applied energy model allows to assign anenergetic value that is a pure function of the atomic coordinates, atomtypes and chemical bonds in a protein structure, or any part thereof.When considering rotamers it is preferred, although not required, thatthe energy function be pair-wise decomposable, i.e. the total energy ofa protein structure can be written as a sum of single and pair-residuecontributions (see equation 4 below), in order to facilitate theadaptation step in the generation of ECO's.

Preferably, the applied energy model should approximate as close aspossible the true thermodynamic free energy of a molecular structure.Therefore, the applied energy function should include as many relevantenergetic contributions as possible, depending on the accuracyrequirements of the user. Highly preferred energy functions includeterms which account for at least Van der Waals interactions (including aLennard-Jones potential such as disclosed by Mayo et al. in J. Prot.Chem. (1990)), hydrogen bond formation (such as referred by Stickle etal. in J. Mol. Biol. (1992) 226:1143), electrostatic interactions andcontributions related to chemical bonds (bond stretching, angle bending,torsions, planarity deviations), i.e. the conventional potential energyterms. Useful energy models including such contributions are, forexample, the CHARMM force field of Brooks et al. (cited supra) or theAMBER force field disclosed by Weiner et al. in J. Am. Chem. Soc. (1984)106:765.

In addition to these potential energy terms, some other energeticcontributions may be important, given the fact that different ECO's mayinvolve different residue types of varying size and chemical nature,placed at topologically different positions in a protein structure. Fourtypes of energetic contributions and corrections may account for type-and topology-dependent effects: corrections for (1) main-chainflexibility, (2) solvation effects, (3) intra side-chain contributionsand (4) the behavior of particular residue types in the unfolded state.

Main-chain flexibility, or adaptability of the backbone to an insertedpattern, can be corrected by reducing the sensitivity of the termsrelated to packing effects, e.g. using methods including united-atomparameters such as disclosed by Lazar et al. In Protein Sci. (1997)6:1167-1178 or re-scaling of atomic radii such as proposed by Dahiyat etal. Proc. Natl. Acad. Sci. USA (1997) 94: 10172-10177. When generatingECO's using a gradient method in the adaptation step, main-chainvariability is automatically taken into account by including thecorresponding atoms in the optimization process.

Because of the hydrophobic effect, solvation effects may be a crucialfactor determining the reliability of modeled protein structures.According to the present invention, computing explicit protein-solventinteractions is not preferred since this is prohibitively slow. Instead,a compromise between computational performance and prediction accuracyis a method wherein the burial and exposure of polar and non-polar atomsis translated into energetic correction terms for solvation on the basisof their buried and exposed surface areas and of a set of atomtype-specific solvation parameters, such as reviewed by Gordon et al. inCurr. Opin. Struct. Biol. (1999) 9:509-513. In the embodiment of thepresent invention wherein the adaptation step in the ECO generationoccurs in a rotameric way, it is beneficial to have a pair-wisedecomposable solvent function, such as proposed by Street et al. (citedsupra). Also preferred for its computational efficiency is a method forthe assignment of a set of energetic solvation terms to each of theconsidered residue types (usually the 20 naturally-occurring aminoacids) depending on the degree of solvent exposure of their respectiverotamers at the considered residue positions in the protein structure ofinterest. This method is hereafter referred to as the type-dependent,topology-specific solvation (TTS) method. In this method, a set ofN_TYPES energetic parameters is established for each of N_CLASSESdifferent classes of solvent exposure, where N_TYPES is the number ofresidue types considered (usually 20) and N_CLASSES is the number ofdiscrete classes of solvent exposure. For example, three such classesmay be established, corresponding with buried, semi-buried andsolvent-exposed rotamers, respectively. It is noteworthy that differentrotamers for the same residue type at the same position may be assigneddifferent classes. Both the number of classes and their boundaries maybe subject to variation, but the following embodiment should bepreferred:

-   -   first, each pattern element (a given residue type at a given        residue position in a specific rotameric state) is substituted        into the protein structure and its accessible surface area (ASA)        is calculated, for instance according to Chothia in        Nature (1974) 248:338-339,    -   next, a class assignment occurs on the basis of the percentage        ASA of the residue side chain in the protein structure compared        to the maximal ASA (over all rotamers) of the same side chain        being shielded from the solvent only by its own main-chain        atoms, for instance the class of buried rotamers may be defined        by ASA %<5%, the class of semi-buried rotamers by 5%<ASA %<25%        and the class of solvent-exposed rotamers by 25%≧ASA %,    -   and finally an appropriate type- and topology-specific energy,        E_(TTS)(T,C), for the considered pattern element is retrieved        from a table of TTS values, using the pattern type (T) and        class (C) indices.

The TTS energy thus obtained should be added as an extra term to theenergy of the pattern element in the protein structure (see FORMALDESCRIPTION OF ECO'S).

-   -   A table of TTS values may be established by tuning the different        TTS parameters E_(TTS)(T,C) versus a heuristic scoring function        S as follows. The tuning method includes:    -   first considering a set of a number of, preferably at least 50,        well-resolved protein structures and assigning to each residue        position of this set a unique index i (1≦i≦all residues),    -   then considering, at each residue position i, all residue types        a in all rotameric states r, thereby forming the set of all        possible position-type-rotamer combinations i_(r) ^(α),    -   defining by i_(r) ^(w) the WT residue type at position i in a        rotameric state r as observed in the protein comprising position        i,    -   assigning to each it a solvent exposure class C(i_(r) ^(a)) as        described above,    -   calculating for each i_(r) ^(a) the value E_(pot)(i_(r) ^(a)) as        the total potential energy the rotamer i_(r) ^(a) experiences        within the fixed context of the protein structure comprising i,        preferably using conventional potential energy terms such as        disclosed above,    -   defining a function ΔE(i) as

ΔE(i)=min_(r)(E _(pot)(i _(r) ^(w))+E _(TTS)(w,C(i _(r)^(w))))−min_(a)min_(r)(E _(pot)(i _(r) ^(a))+E _(TTS)(a,C(i _(r) ^(a))))  (eq. 2)

-   -   ΔE(i) being interpretable as the difference in energy of the WT        residue type in its best possible rotamer (as obtained by the        min, operator) and the energetically most favorable residue type        at residue position i also in its best possible rotamer (as        obtained by both the min_(a) operator and the min, operator),        given a(as yet undefined) set of TTS parameters, and    -   applying a suitable parameter optimization method to maximize S        by varying E_(TTS)(T,C) parameters.

If the TTS parameters are zero, then equation (2) provides thedifference in potential energy between the optimal WT rotamer and theenergetically most favorable residue type at position i in its optimalrotameric state, within the context of the protein structure involved.The effect of assigning non-zero values to the TTS parameters is thatthe energy of a residue will be modified by its associated type- andtopology-dependent correction term. In the ideal case, theTTS-parameters may be such that, at each position i, the observed WTresidue becomes or approximates the energetically optimal residue type.In order to reach this objective, the objective scoring function Sdefined by the following equation (3)

S=Σ _(i)(1+ΔE(i))⁻¹   (eq.3)

may be used, which means that if at position i the WT residue type isbest (i.e. ΔE(i)=0), then S is increased by one unit. For instance, ifthe WT would be 1 kcal. mol⁻¹ higher in energy than the best possibletype, this would result in an increase of S by 0.5 unit.

Finally, It has been observed that the function S as defined by equation(3) on 60 TTS parameters (20 types×3 classes) is easy to optimize bysimply taking steps above and below the current value of each parameterand accepting only up-hill changes (i.e. better values) in S. Initialparameters may be set to zero and initial steps are preferably about 10kcal mol⁻¹, gradually reducing in size by about 75% per optimizationround until a step size of about 0.05 kcal mol⁻¹ is reached, whichterminates the parameter optimization process.

The same strategy of parameter tuning may lead to even more relevantresults if the potential energy calculations, E_(pot)(i_(r) ^(a)), areperformed within the context of an adaptive protein structure ratherthan a fixed structure. This can be accomplished e.g. by using one ofthe structure optimization methods described in the next section.Moreover, such approach is fully consistent with the ECO concept itself,wherein the compatibility of patterns is evaluated in an adaptiveenvironment. A set of TTS contributions derived in this way is shown intable 3.

A most attractive feature of the TTS method is that terms used thereinwill automatically account for two other effects which may be criticalin the field of protein design. The first effect is the mutual bias ofdifferent residue types when comparing their intraside-chain energies.For example, an arginine residue may have an energetic advantage over achemically similar lysine residue by more than 5 kcal mol⁻¹ (using the

CHARMM force field) which is solely due to intraside-chaincontributions. In order to correct for this phenomenon in the prior art,such contributions may be simply switched off or corrected for bysubtracting a reference energy calculated as the potential energy of anisolated amino acid residue, according to Wodak (cited supra). However,the present invention prefers not to impose any a priori intraside-chaincorrection, but to incorporate this into the TTS contributions. Thesecond effect relates to energetic aspects of the unfolded state, whichis ignored when computing energies exclusively from the folded state ofthe protein. This involves complicated elements like changes in entropy,residual secondary and tertiary structure in the unfolded state, theexistence of transient local and global states, and so on. Consequently,in a modeling environment where native, folded structures are examined,it is extremely difficult to predict the impact of such effects on thepredictions. However, in as far as such effects would be type- andtopology-dependent, it may be assumed that the TTS method implicitlytakes them into account. For example, buried leucine residues mayexperience a greater loss in torsional entropy compared to isoleucineupon folding of the protein in view of the fact that the latter isbeta-branched.

In conclusion, the present method for generating ECO's requires asuitable energy function including as many relevant contributions aspossible, preferably functions based on inter-atomic distances and bondproperties. There is no specific preference with respect to explicittreatment of hydrogen atoms or not. The implementation of theaforementioned TTS method will be useful when ECO's are prepared forfold recognition or protein design, as well as for other applications.

Structure Optimization

When generating ECO's, a practical and reliable method for structureoptimization is needed, i.e. a search method producing an optimal ornearly optimal 3D structure for a protein molecule, given the startingstructure of the protein as previously described in STRUCTUREDEFINITION, the introduction of one or more substitutions according tothe pattern of interest (except for the reference structure described inthe section REFERENCE STRUCTURE below), the allowed conformationalvariation of the protein system as previously described inCONFORMATIONAL FREEDOM and the applied energetic model as described inENERGY FUNCTION.

ECO's are intended to measure the change in total energy when non-nativepatterns are introduced in the context of a protein structure, i.e. anECO energy is actually a differential value between the total energy ofa substituted protein structure and that of a native or “reference”protein structure. Both the substituted and the reference structuresshould be modeled in a similar and reliable way in order to allow forinterpretation of an ECO energy as a measure of the true substitutionenergy associated with the pattern of interest. Therefore the methodshould guarantee that both the substituted and the non-substitutedprotein structures approximate as close as possible the structures asmay be determined experimentally.

Assuming a nearly constant backbone conformation may be a source oferror for the most constrained substitutions: proteins comprisingcomplicated substitutions may reorganize their backbone conformation ormay even completely unfold. In such cases, ECO substitution energiespredicted according to the present invention may be in error, but shouldnot change the conclusion that such substitutions are probablyincompatible with the protein backbone.

-   -   The present invention has no preference with respect to the        structure optimization (or search) method used as long as it        provides a beneficial ratio of prediction accuracy versus        computational speed. Suitable known search methods include, for        example, Monte Carlo simulation such as disclosed by Holm et al.        in Proteins (1992) 14:213-223, genetic algorithms such as        disclosed by Tuffery et al. in J. Comput. Chem. (1993)        14:790-798, mean-field approaches such as disclosed by Koehi et        al. in J. Mol. Biol. (1994) 239:249-275, restricted        combinatorial analysis such as disclosed by Dunbrack et al in J.        Mol. Biol. (1993) 230:543-574, basic molecular mechanics        gradient methods such as described in Stoer et al. in        Introduction to Numerical Analysis (1980), Springer Verlag, New        York and DEE methods such as previously disclosed. In the        preferred embodiment of the present invention wherein the        conformational freedom of individual residues is represented by        rotamers, a yet unknown structure optimization method is however        preferred over the previously known methods because it combines        a high accuracy together with a high intrinsic computational        efficiency. This combined characteristics is important        especially in the systematical pattern analysis (SPA) embodiment        of this invention wherein typically hundreds of thousands of        patterns per protein need to be analyzed. This method,        hereinafter referred as the FASTER method, is a method for        generating information related to the molecular structure of a        biomolecule (e.g. a protein), the method being executable by a        computer under the control of a program stored in the computer,        and comprising the steps of:    -   (a) receiving a 3D representation of the molecular structure of        said biomolecule, the said representation comprising a first set        of residue portions and a template;    -   (b) modifying the representation of step (a) by at least one        optimization cycle; characterized in that each optimization        cycle comprises the steps of:        -   (b1) perturbing a first representation of the molecular            structure by modifying the structure of one or more of the            first set of residue portions;        -   (b2) relaxing the perturbed representation by optimizing the            structure of one or more of the non-perturbed residue            portions of the first set with respect to the one or more            perturbed residue portions;        -   (b3) evaluating the perturbed and relaxed representation of            the molecular structure by using an energetic cost function            and replacing the first representation by the perturbed and            relaxed representation if the latter's global energy is more            optimal than that of the first representation; and    -   (c) terminating the optimization process according to step (b)        when a predetermined termination criterion is reached; and    -   (d) outputting to a storage medium or to a consecutive method a        data structure comprising information extracted from step (b).

In short, this method works by repeated, gradual optimization ofrotameric sets of residues. When the protein system is near its globallyoptimal structure, which is the case if small patterns (e.g. one or tworesidues) are substituted into a reference structure, only a smallnumber of optimization rounds are needed to obtain the optimally relaxedstructure. This process typically takes in the order of about 10 to 100milliseconds CPU time per pattern analysis (using a 300 MHz R12000processor as a reference). The computation of the reference structureitself typically requires about 100 to 1000 CPU-seconds per proteinstructure.

Reference Structure

The reference structure is defined as the structure of lowest energy forthe protein of interest, given an experimentally determined orcomputationally generated starting structure, a particular rotamerlibrary (only in the rotameric embodiment of the present invention) anda selected energy function and may be obtained by a suitable searchmethod (see STRUCTURE OPTIMIZATION).

The role of the reference structure is to make an ECO energy (the valuecomputed in step F) interpretable as a substitution energy. ECO's aregenerated by substituting a non-native pattern into an optimally relaxedreference structure, reconsidering (in the structure adaptation step)the optimal packing arrangement and computing the difference in globalenergy between the modified and the original reference structure.

An important feature of the reference structure is its amino acidsequence. For instance, the protein of interest may be modified prior tocomputing its structure of lowest energy by supplying the backbone withnew side chains. Thus, the user of the method may define a referencesequence which is different from the sequence observed in the startingstructure of the protein. Either a few amino acid side chains or thecomplete amino acid sequence may be altered. The reference structure isthen the conformation of lowest energy for the protein of interesthaving a user-selected reference amino acid sequence. The most readyreference sequence is the native amino acid sequence of the proteinstarting structure, but valuable alternatives thereto will be explainedhereafter.

The reference sequence should preferably approximate as close aspossible the sequences that will be generated in an ECO-basedapplication. For example, when interest is in thermally stabilizingmutants of a given enzyme of known sequence but unknown structure, andassuming that only a distantly related homologous 3D-structure would beavailable from a database of protein structures, then it is advisable tocompute the structure of lowest energy for the sequence of said enzyme(e.g. using the homologous backbone structure) rather than for thesequence of the homologous protein (using its own backbone structure).If the mutants of the said enzyme were built from ECO's generated forthe homologous structure, this would involve many more individualchanges (and potential sources of error) as compared to when the enzymemutants are built with ECO's generated with the modeled enzyme sequenceas a reference structure.

However protein design and fold recognition applications may also relyon some general purpose databases of ECO's derived from representativeprotein structures or scaffolds. Furthermore, it is not an absoluterequirement to use any known or existing reference amino acid sequencewhen generating ECO's. A most interesting type of reference sequence maybe the sequence leading to the lowest possible energy for the protein ofinterest, in other words the global minimum energy sequence which may becomputed by any suitable search method such as previously described (seeSTRUCTURE OPTIMIZATION), e.g. a DEE method or the FASTER method. Someother reference sequences, such as a polyglycine or a polyalaninesequence, may provide practically useful reference structures. Theyprovide the advantage that the adaptation step in the ECO computation iseasier, but the disadvantage that the ECO patterns are modeled onto analmost naked backbone. Finally, we observed experimentally thatassessing multiply substituted proteins from single and double ECO's issurprisingly insensitive to the nature of the reference sequence,especially in applications which tolerate some level of approximation(like fold recognition). Therefore any possible reference sequence,including randomly generated and dummy sequences, may be practicallyuseful in particular cases. Selecting a reference sequence forms anessential part of generating an ECO since it directly relates to theinterpretation of the associated energy value as being a substitutionenergy.

Description of the ECO Concept

An “energetic compatibility object” (ECO) is an object which basicallyprovides a fully structured description of the results of a modelingexperiment wherein a given pattern p of interest has been introduced andmodeled into a protein reference structure. Usually the pattern pconsists of a small number of amino acid residue types, each beingplaced at a specific residue position in the reference structure andadopting a well-defined fixed conformation. In a preferred embodiment ofthe present invention, hereinafter referred as the “systematical patternanalysis” (SPA) mode, a large number of predefined patterns may begenerated and analyzed within the context of the reference structure.The SPA mode is particularly suited to obtain large-scale resultsconcerning the mutability of single and pairs of residues. It alsoprovides the necessary input data for the analysis of multiplysubstituted proteins on the basis of elementary data components.

The number of residues of p is usually 1 or 2 but larger patterns ofinterest, comprising e.g. a set of clustered core residues, may beconsidered as well. In general, the present invention does not imposerestrictions to the number of residues included in p since by definitionp is a pattern of interest to the user. However, in the preferred SPAmode, where ECO's are systematically generated for many differentpatterns, the number of residues of any such pattern may preferably belimited to 1 or 2.

Besides the number of residues, an unambiguous description of p alsorequires establishing the exact locations of the one or more residues ofp within the framework of the reference structure. In the presentinvention the allowed locations correspond with amino acid residuepositions, which means that the said locations are defined by selecting,for each pattern residue, a specific amino acid residue position ofinterest. All conventional amino acid residues which are present in theprotein reference structure are valid positions, irrespective of whetherthey are chemically modified or cross-linked to other residues.Conversely, atoms or molecules not belonging to a protein or peptide,known in the art as ligands, are not considered as valid positions. Inthe case of patterns consisting of more than one residue, some furtherrules apply as follows. Any two residues of the same pattern p shouldnot occupy the same position. Also, as will be appreciated by those inthe art, it may be desirable to consider only patterns for which theresidue positions are proximate in space since very distant residues arelikely to be independent. However this is preferred only when (i) alarge number of different patterns has to be examined and computationalspeed is a limiting factor, and (ii) the condition of independence andthe related criterion of proximity have been statistically verified. Inthe preferred SPA mode of implementation of the invention, the residuepositions of interest may be determined by the region in the proteinreference structure which is of interest to the user. If the interest ise.g. in stabilizing a protein, only core residues may be taken intoaccount and a set of patterns may be systematically defined for allsingle core residues as well as all possible pair-wise combinationsthereof.

A further aspect of the present invention is the chemical nature of theone or more residues of p which may be validly considered at theselected positions in the reference structure. According to theinvention, all naturally-occurring and synthetic amino acid types may beincluded in the set of valid residue types. The choice of these residuetypes is basically unrelated to the amino acid sequence of the referencestructure. If a residue type of p differs from the residue type in thereference structure at the selected position, this actually correspondsto a mutation. In such case, the original residue type needs to beremoved from the reference structure and replaced by the pattern residuetype. Each replacement step is preferably done by respecting the rulesof standard geometry (standard bond lengths and angles and defaulttorsion angles). In the preferred SPA mode of implementation of theinvention, the set of single residue patterns may consist of all validresidue types as defined above, placed at each residue position of theregion of interest. The set of pair residue patterns may consist of allpossible pair-wise combinations thereof.

Fully consistent with the present invention is any method used in step Bin order to further restrict the number of valid residue types at eachposition in the region of interest. Such an embodiment may be used (1)when the available computational resources necessitate a restriction, or(2) in cases which are interrelated with the application of ECO's.Ignoring the first case (thus assuming infinite computationalresources), it will be clear to those skilled in the art that someresidue types may be discarded from consideration depending on theposition in the reference structure. Examples of such restrictionmethods include, without limitation:

-   -   the exclusion of synthetic residue types,    -   the exclusion of proline (actually an imino acid) residues        depending on the local main-chain conformation,    -   the restriction of core residues to non-polar types,    -   the exclusion of residue types with low helical propensity in        alpha-helices,    -   the preservation of the chemical nature (e.g. aromatic) of        residue types in the reference structure, and so on.

So far, a pattern p is defined as one or more residue types placed intothe reference structure at well defined residue positions. In order tounambiguously define p, the conformation of each p-residue must befurther specified by the assignment of one specific conformation to eachresidue of p, preferably by selecting a specific rotamer from a rotamerlibrary as described hereinabove. Alternatively, in the non-rotamericembodiment of the present invention, the said conformational assignmentmay result from other sources including, but not limited to,experimentally determined protein structures, computer-generatedconformations or, as described below, a grouping process applied todifferent ECO's. The present invention does not impose restrictions tothe nature and origin of the conformation of the p-residues. In thepreferred SPA mode of implementation of the present invention, all(combinations of) rotamers are applied to the p-residues. For patternsconsisting of residue pairs, this leads to a six-dimensional space. Inorder to keep the “volume” of this hyper-space within reasonable bounds,a variety of simple and more complicated criteria may be established inorder to reduce the number of patterns for residue pairs. However, suchcriteria are ignored in the present invention since the results of anyreduction method necessarily leads to a set of patterns of interest,which is the subject of the present invention.

The most important property of an ECO is its energy value which isdefined as the difference between the global energy of the referencestructure and the global energy of the same reference structure afterits modification by the steps of (1) introducing the ECO pattern intothe reference structure and (2) computing the globally optimalconformational rearrangement of the reference structure as illustratedin FIG. 2. Steps (1) and (2) will be referred to as the patternintroduction (PI) step and the structural adaptation (SA) steprespectively. Computing the ECO energy value is therefore preferablycompleted by a third step called the energy calculation (EC) stepwherein the global energy of the substituted and adapted structure iscalculated and further wherein the global energy of the referencestructure is subtracted from this value.

The PI step involves the necessary substitutions and the generation ofthe correct conformation of the p-residues, as dictated by the precisedefinition of the pattern p. This is preferably accomplished by aconventional protein modeling package such as disclosed by Delhaise etal. (cited supra) comprising practical tools in order to perform aminoacid substitutions and conformational manipulations. Once the pattern pis introduced, the amino acid type(s) and conformation(s) of thesubstituted residues are frozen throughout the process leading to theECO energy value computation. The PI step can be seen as a specificperturbation of the reference structure, leaving this structure in asub-optimal state. Indeed, the introduced residues in their fixedconformation may very well be in steric conflict with surrounding aminoacids.

In the SA step, the latter are given a chance to rearrange and find anew relaxed conformation in which steric constraints are alleviated andintramolecular interactions are improved. According to the presentinvention, this rearrangement is energy-based, i.e. the cost function tobe optimized during the SA step is the energy of the global structure asdefined in the section ENERGY FUNCTION. In a preferred embodiment of theinvention, especially the rotameric embodiment described in the sectionCONFORMATIONAL FREEDOM, all atoms of the pattern residues and themain-chain of the reference structure are held fixed throughout the SAstep. In a less preferred embodiment, the pattern atoms remain fixed butthe main-chain atoms may slightly (i.e. by about 1 to 2 Å) change inposition during the SA step. In another embodiment, and only for thenon-rotameric approach, also the pattern atoms may change in positionduring the adaptation step; this embodiment is least preferred since theintroduced pattern then looses its original structure and accounts, atleast partially, for the structural relaxation effects.

The SA step itself may be performed by any method suitable for proteinstructure prediction (see STRUCTURE OPTIMIZATION) on the basis of anobjective energetic scoring function (see ENERGY FUNCTION). The goal ofthe SA step is to find the energetically best possible substituted andadapted structure. In this step, the substituted reference structure issubmitted to a structural optimization process exploring theconformational space of the protein system, taking into account (i) therestraints imposed onto the backbone and pattern atoms (which arepreferably fixed), (ii) the allowed conformational freedom (e.g. asdetermined by a rotamer library) and (iii) the applied energy function.The specific procedure used in order to perform the said structuraloptimization is not critical to the present invention. Methods which maybe used for this purpose are described in the section STRUCTUREOPTIMIZATION.

In the EC step, the ECO energy value is calculated as a differentialvalue, i.e. the energy of the substituted and adapted structure obtainedfrom the SA step, minus the energy of the reference structure. Bothenergies can be readily obtained by a standard energy calculation methodusing the relevant energy function in order to compute all relevantintra-molecular interactions and providing a quantitative estimate forthe free or potential energy of a protein structure. Therefore the ECOenergy value obtained by substracting both energies is a quantitativemeasure for the energetic cost to mutate the pattern p into thereference structure. As far as the computed energies correspond with thetrue thermodynamic free energies of the original (WT) and mutatedproteins, they will reflect the change in thermodynamic stability of themutated protein. Consequently, the ECO energy value can also beconsidered as a fair estimate of the structural compatibility of thepattern p within the framework of the protein of interest. The ECOenergy value can be either positive (in which case the p-residues areexpected to be less compatible with the protein than the WT residues) ornegative (in which case the corresponding pattern is expected tostabilize the protein molecule).

Apart from the ECO pattern and the associated energy value, an ECO mayalso include other relevant information, including data referring to theorigin of the ECO such as (i) a reference (e.g. a PDB-code) to the (e.g.experimentally determined) 3D structure of the protein used in the ECOgeneration, (ii) a reference to the rotamer library used to model thereference structure and the adapted structure, (iii) a reference to theapplied energy function, (iv) a reference to the method used in theadaptation step, (v) a description of the reference sequence andstructure, and (vi) the transformations (as discussed below) which maybe applied to the native ECO. The latter data are important for acorrect interpretation of the ECO energy. Moreover, they provide a largeamount of added value:

-   -   first, since an ECO is self-describing, it should be possible to        verify or reconstruct its energy value.    -   secondly, the information related to energetic transformations        provide the user with a full autonomy to restore or re-transform        these data (e.g. to express the data in normalized units) if        needed in a particular application.    -   thirdly, this information considerably facilitates all forms of        data mining. For example, it makes it easy to compare sets of        ECO's constructed with different rotamer libraries, homologous        backbone structures, energy models, adaptation methods or        reference sequences.

In addition to the aforementioned data, other relevant thoughnon-essential informations may be stored in the ECO object. Examples ofsuch informations include, but are not limited to, (i) a description ofthe observed secondary structure for each pattern position, (ii) atomiccoordinates for some or all atoms of the pattern residues, e.g. atomiccoordinates for the Cα atoms, which may be helpful in identifyingneighboring/distant residues without loading a full 3D-structure, (iii)a description of the solvent accessibility (e.g. the accessible surfacearea) of each p-residue. While not essential for the assessment of theECO energy value, the storage of such additional data may be valuablewhen using ECO's for specialized applications like high-throughput foldrecognition or protein design. A more extensive list of additional ECOelements is provided in the section FORMAL DESCRIPTION OF ECO'S.

For the sake of compactness, data elements in an ECO object which referto the protein structure, residue types, rotameric states, and so on arepreferably encoded as indices pointing to data present in externallists, instead of explicitly including the raw data. Also, ECO's sharingcommon properties (e.g. ECO's generated for the same protein structure,using the same rotamer library and energy function) may be grouped intoa data structure (e.g. a file, an array or a database), the sharedproperties being assigned to the data structure rather than included ineach ECO object. However, this kind of data storage and encoding ispurely a mode of implementation, not an essential aspect of the presentinvention.

Many further transformations of the native ECO energy values obtainedfrom the EC step may be useful within the context of specificapplications of the present invention. Examples of such transformationsinclude, without limitation, (i) the expression of ECO energiesrelatively to the lowest value of all ECO's known for different patternslocated at the same residue positions, so that each transformed ECOenergy becomes positive, (ii) transforming ECO energies by means of alinear, logarithmic or exponential function, (iii) truncating everytransformed or non-transformed ECO energy above a given threshold valueto that threshold value, (iv) the expression of ECO's in normalizedunits, and so on.

Another type of transformation, i.e. the grouping of native ECO's intogrouped ECO's (referred as GECO's), does form part of the presentinvention. This grouping is applicable to the SPA mode wherein amultitude of ECO's are generated for identical positions in a protein.In a preferred embodiment of this invention, ECO's sharing identicalpositions in the protein of interest may be grouped into GECO's in orderto (i) reduce the amount of information and (ii) extract the mostsignificant information. The importance of ECO grouping as well as somepreferred methods to generate GECO's is discussed in the section GROUPEDEGO'S.

Formal Description of ECO'S

In order to formally describe ECO's and to use ECO energies for singleand pairs of residues in computing the energy of multiply substitutedproteins, the following notations are used (see also LIST OF SYMBOLICNOTATIONS).

-   -   N: the set of residue positions of interest within the protein        of interest. Example 1: N={all amino acid residues of the        protein]. Example 2: N={all fully buried amino acid residues of        the protein}. ‘i: a specific residue position of interest i,        where i ∈ N. Example: i=1 denotes the first residue position of        N.    -   A(i): the set of residue types of interest at position i. In the        SPA mode of the present invention, this set may contain multiple        residue types. Example 1: A(1)={all natural amino acid residue        types}. Example 2: A(1)={Gly, Ala, Ser}.    -   i¹: a specific residue type of interest a at position i, where        i¹ ∈A(i). Example: i^(a)=1² denotes the second residue type of        A(1), at the first residue position.    -   R(i^(a)): the set of rotamers of interest for residue type a at        position i. Rotamers of interest may be retrieved from either a        backbone dependent or independent library. Example 1:        R(i^(a))=R(a)=the set of rotamers known for residue type a in        the rotamer library described by De Maeyer et al. (cited supra).        Example 2: R(i^(a))=R(a)\{rotamers which are        backbone-constrained at position Example 3: the set of rotamers        R(i^(a)) where a refers to a Gly residue type may contain either        no element or one dummy element since Gly has no rotatable side        chain.    -   i_(r) ^(a): a specific rotamer of interest r for residue type a        at position i, where i_(r) ^(a) ∈ R(i^(a)). A specific instance        of i_(r) ^(a) forms a uniquely defined single residue pattern.        Example: i_(r) ^(a)=1₃ ² denotes the pattern formed by the third        rotamer of R(1²), for the second residue type of A(1), at the        first residue position.    -   [i,j]: the pair of specific residue positions of interest i and        j.    -   [i^(a),j^(b)]: the pair of residue positions i and j, where        residue types a and b are placed at positions i and j,        respectively (in an undefined conformation).    -   [i_(r) ^(a),j_(s) ^(b)]: the pair of residue positions i and j,        where residue types a and b are placed at positions i and j,        respectively, and residue types a and b adopt rotameric states r        and s, respectively. A specific instance of [i_(r) ^(a),j_(s)        ^(b)] forms a uniquely defined pair residue pattern.    -   [i,j,k, . . . ]: the multiplet of residue positions of interest        i,j,k, . . .    -   [i^(a),j^(b),k^(c), . . . ]: the multiplet of residue positions        i,j,k, . . . where residue types a, b, c, . . . are placed at        positions i,j,k, . . . , respectively (in an undefined        conformation).    -   [i_(r) ^(a),j_(s) ^(b),k_(t) ^(c), . . . ]: the multiplet of        residue positions i,j,k, . . . where residue types a, b, c, . .        . are placed at positions i,j,k, . . . , respectively, and        residue types a, b, c, . . . adopt rotameric states r, s, t, . .        . , respectively. A specific instance of [i_(r) ^(a),j_(s)        ^(b),k_(t) ^(c), . . . ] forms a uniquely defined multi-residue        pattern.    -   P: the complete set of patterns of interest when generating        ECO's in the SPA mode.    -   P(i): the set of single residue patterns of interest at position        i, where P(i)⊂P. By analogy, P([i,j]) and, in general, P([i,j,k,        . . . ]) is defined for pair and multiple residue patterns.    -   P(i^(a)): the set of single residue patterns for residue type a        at position i, where P(i^(a))⊂P(i)⊂P. By analogy,        P([i^(a),j^(b)]) and, in general, P([i^(a),j^(b),k^(c), . . . ])        is defined for pair and multiple residue patterns having defined        residue types.    -   ref: the reference structure of the protein of interest.        Preferably the reference structure is the energetically best        possible global structure for the protein of interest. In the        preferred embodiment of the invention wherein rotamers are used        to describe the allowed conformational states of residue types,        the reference structure can be written as a special pattern        [i_(g) ^(r),j_(g) ^(r),k_(g) ^(r), . . . ]. In this pattern, all        N residues i,j,k, . . . of the protein (excluding those of the        template) assume the reference amino acid type as defined in the        reference sequence (referred to as the “reference type” r), and        a particular rotameric state as obtained from a structure        prediction method such as a DEE method (referred to as the        “global minimum energy rotamer” g). Note that r and g are not        variables and are therefore not written in italic. Less        preferably, the reference structure may be any energetically        relaxed, e.g. energy minimized by a steepest descent or        conjugated gradient method, structure. Least preferably, the        reference structure may be any non-optimized experimental or        theoretical structure for the protein of interest or,        alternatively, the reference structure may be void and may be        assigned a reference energy E_(ref)=0.    -   E_(ref): the global energy of the reference structure. In the        preferred embodiment using rotameric descriptions, a fixed        template and a pair-wise summable energy function, E_(ref) can        be conveniently written as

E _(ref) =E _(template)+Σ_(i) E _(t)(i _(g) ^(r))+Σ_(i)Σ_(j>i) E _(p)(i_(g) _(r) ,j _(g) ^(r))   (eq. 4).

-   -   Here, E_(template) is the template self-energy, i.e. the sum of        all atomic interactions within the template. The superscripts r        refer to the reference amino acid type and the subscripts g        denote the reference rotamer (the global minimum energy        rotamer). E_(t)(i_(g) ^(r)) is the direct interaction energy of        i_(g) ^(t) with the template, including its self-energy (the        subscript t refers to the template). The self-energy may include        a solvent correction term such as a TTS term as discussed in the        section ENERGY FUNCTION. E_(p)(i_(g) ^(r),j_(g) ^(r)) denotes        the pair-wise residue/residue interaction energy, also called        rotamer/rotamer pair-energy between i_(g) ^(r) and j_(g) ^(r).        Likewise, pair-energies may include a solvent correction term as        discussed in the section ENERGY FUNCTION. Optionally, the term        E_(template) can be omitted in equation (4) since (i) it is a        constant and (ii) E_(ref) is to be used in combination with        E(i_(r) ^(a)), discussed below, wherein the same term can be        discarded (in such case, E_(ref) is not a global but a partial        energy). A great advantage of equation (4) is that the E_(t) and        E_(p) terms can be pre-computed for all possible instances of        single and pair residue patterns. These values can then        repeatedly be used in the search for the optimal reference        structure as well as in the computation of ECO energy values. In        a less preferred embodiments wherein no rotamers are used and/or        the template is not fixed and/or a non-pair-wise summable energy        model is applied, the reference energy E_(ref) must be        calculated in accordance with the characteristics of the applied        energy model, and typically as a sum over all possible        inter-atomic interactions within the reference structure. In the        least preferred embodiment, where no reference structure is        considered, E_(ref) may be set equal to 0 or any arbitrary        value, with the consequence that ECO energies can no longer be        interpreted as “substitution energies”. However, such        embodiments may still be useful in specific applications wherein        the absolute values of ECO energies are irrelevant, e.g. when        different mutations are compared one with another.    -   E(i_(r) ^(a)): the global energy for the reference structure        which has been modified by (1) a PI step wherein the single        residue pattern i_(r) ^(a) is mutated into the reference        structure. and (2) a SA step as described in the section        DESCRIPTION OF THE ECO CONCEPT. This is not a differential but        an absolute energy for a complete molecule. In the preferred        embodiment using rotamers, a fixed template and a pair-wise        energy function, this energy can be calculated in accordance        with following equation (5), wherein the terms applicable to        residue i of the single residue pattern i_(r) ^(a) are replaced        by the appropriate energy contributions:

E(i _(r) ^(a))=E _(template) +E _(t)(i _(r) ^(a))+Σ_(j) E _(p)(_(r) ^(a),j _(g) ^(r))+Σ_(j) E _(t)(j _(g) ^(r))+Σ_(j)Σ_(k>j) E _(p)(j _(g) ^(r),k _(g) ^(r),O; j,k≠I   (eq. 5).

-   -   In this equation, the residue positions j,k≠i may have quit        their reference rotameric state r as a result of the SA step,        but not their reference type g: Adaptation occurs by        conformational rearrangements wherein type switches are not        allowed. In the optional case that E_(template) has been omitted        in the calculation of E_(ref), the same term should be omitted        in equation (5) as well and E(i_(r) ^(a)) is not a global but a        partial energy. In a less preferred embodiment wherein no        rotamers are used and/or the template is not fixed and/or a        non-pair-wise summable energy function is applied, the value for        E(i_(r) ^(a)) must be calculated similarly to the reference        energy E_(ref), typically as a sum over all possible        inter-atomic interactions within the substituted and adapted        structure.    -   E_(ECO)(i_(r) ^(a)): the ECO energy value for the single residue        pattern i_(r) ^(a). As previously defined, it is the energy of        the reference structure which has been substituted by and        conformationally adapted to the pattern i_(r) ^(a), E(i_(r)        ^(a)), minus the energy of the reference structure:

E _(ECO)(i _(r) ^(a))=E(i _(r) ^(a))−E _(ref)   (eq.6).

-   -   This definition is valid for all embodiments of the present        invention. By analogy, it is also possible to define ECO        energies for pair patterns. In the preferred embodiment using        rotamers, a fixed template and a pair-wise energy function, the        global energy of the reference structure substituted by and        adapted to a pair residue pattern [i_(r) ^(a),j_(s) ^(b)] can be        written as follows:

E([i _(r) ^(a) ,j _(s) ^(b)])=E _(template) +E _(t)(i _(r) ^(a))+E_(t)(j _(s) ^(b))+E _(p)(i _(r) ^(a) ,j _(s) ^(r))+Σ_(k) E _(p)(_(r)^(a) ,k _(t) ^(r))+Σ_(k) E _(p)(j _(s) ^(b) ,k _(t) ^(r))+Σ_(k) E _(t)(k_(t) ^(r))+Σ_(k)Σ_(t>k) E _(p)(k _(t) ^(r) ,l _(u) ^(t)); ≠j; k,l≠i,j  (eq. 7)

-   -   The ECO energy for a pair residue pattern is then

E _(ECO)([i _(r) ^(a) ,j _(s) ^(b)])=E([i _(r) ^(a) ,j _(s) ^(b)])=E_(ref)   (eq.8)

In the less preferred embodiments wherein no rotamers are used and/orthe template is not fixed and/or a non-pair-wise summable energyfunction is applied, the value for E([i_(r) ^(a),j_(s) ^(b)]) must becalculated in accordance with the characteristics of the applied energymodel, and typically as a sum over all possible interatomic interactionswithin the substituted and adapted structure. In the general case ofmulti-residue patterns, the calculation of ECO energies occurs as forsingle and pair residue patterns. Here, only the equation for the ECOenergy of a multi-residue pattern [i_(r) ^(a),j_(s) ^(b),k_(t) ^(c), . .. ] is given:

E _(ECO)([r _(r) ^(a) ,j _(s) ^(b) ,k _(t) ^(c), . . . ])=E([i _(r) ^(a),j _(s) ^(b) ,k _(t) ^(c), . . . ])−E _(ref)   (eq. 9)

In order to complete the above formal description of ECO's, a moreconcrete description of the term is now provided. An energeticcompatibility object is a structured data set comprising elements orpieces of information. Here, a distinction may be made between (i) basicelements and (ii) optional elements, depending on whether they areessential for a correct execution of an ECO-based application or whetherthey may simply be helpful in such execution. Clearly, this distinctiondepends on the specific application of interest to a user.

The basic elements of an ECO are those which together form a fulldescription of the origin and characteristics of a substituted andadapted protein structure, i.e. in particular:

-   I. a structural description of the protein of interest or its    PDB-code.-   II. a description of the applied method for structural    pre-refinement of I, if any.-   III. a description of the applied rotamer library, if any. In the    preferred embodiment of the present invention, this corresponds to    the assignment of a set of rotamers R(r) to each position i defined    in (V). For positions i being pattern residues, a refers to a    pattern residue type, whereas for non-pattern positions of interest,    a refers to the residue type as defined in (VI).-   IV. a description of the applied energy function.-   V. a description of the set N, i.e. the set of residue positions i    of interest. This set consists of positions in the protein at which    a pattern p is placed (pattern positions), as well as positions    which are considered in the SA step (non-pattern positions). The    complement of N is defined as the template which, in the preferred    embodiment of this invention, is fixed.-   VI. a description of the amino acid reference sequence, i.e. the    assignment of one specific “reference amino acid type” r to each    position i defined in (V).-   VII. the method applied to compute the reference structure for (VI),    starting from (I) or (II), and using (UI) and (IV) (e.g. a DEE    method).-   VIII. a description of the reference structure for the residue    positions and types defined in (V) and (VI) respectively and    obtained by the method defined in (VII), e.g. the assignment of a    “global optimum” rotamer g to each position defined in (V).-   IX. the pattern residue position(s), i.e. [i,j,k, . . . ]-   X. the pattern residue type(s), i.e. [i^(a),j^(b),k^(c), . . . ]-   XI. the pattern residue conformations, i.e. [i_(r) ^(a),j_(s)    ^(b),k_(t) ^(c), . . . ] (this may be optional for grouped ECO's or    in the embodiments not using rotamers).-   XII. the method applied to perform the SA step for the reference    structure defined in (VIII), wherein the pattern defined in    (IX), (X) and (XI) is introduced, preferably being the same method    as used in (VII).-   XIII. the ECO energy value E_(ECO)([i_(r) ^(a),j_(s) ^(b),k_(t)    ^(c), . . . ]) as defined by equation (9) or, alternatively,    equations (6) and (8) for single and pair residue patterns,    respectively.

Optional elements of an ECO are in particular:

-   XIV. a specification of the residues which have changed their    conformation (e.g. by adopting a rotameric state different from g)    during the SA step, as well as a description of the actual changes    such as, preferably, the new rotamers.-   XV for grouped ECO's, a reference to the grouping method, as well as    a comprehensive description of the results of the grouping process.-   XVI. one or more functional properties of the protein of interest.-   XVII. a local secondary structure description of the pattern residue    positions [i,j,k, . . . ] within the protein structure as in (I).-   XVIII. a description of the solvent accessibility of the pattern    residues within the reference structure (i.e. [i_(g) ^(r),j_(g)    ^(r),k_(g) ^(r), . . . ] and/or within the substituted and adapted    structure (i.e. [i_(r) ^(a),j_(s) ^(b),k_(t) ^(c), . . . ]).-   XIX. cartesian coordinates of one or more atoms of the pattern    residues within the reference structure and/or within the    substituted and adapted structure.-   XX. crystallographic temperature factors associated with the (atoms    of) the pattern residues within the protein structure from which the    reference structure has been deduced.-   XXI. known chemical modifications associated with the pattern    residues within the protein structure from which the reference    structure has been deduced.-   XXII. physico-chemical properties associated with the pattern    residue types. Clearly, this list can be enlarged with any    additional property of interest to the user.

Systematical Pattern Analysis (SPA) Mode

ECO's may be generated by the present method at the time they areneeded, for example, to search locations in a protein where tryptophansubstitutions are allowed. Dozens of questions of this type may beanswered by one or more real-time, case-specific pattern analysisexperiments. However, some structure-related questions can only beresolved by the present method if vast amounts of ECO data areavailable. Moreover, the same ECO information may be used recurrentlywithin the context of different protein design experiments. Also,specific design and recognition tasks may be performed in the absence ofany explicit 3D-representation of a protein. Finally, pre-computed sets(or databases) of ECO's may be data mined to gain structure- andsequence-related insights, again creating added value. Therefore, thereis a need to generate ECO's in a systematic way, called the“systematical pattern analysis” (SPA) mode of operation.

The notations, conventions and abbreviations used below conform to thoselisted in the section FORMAL DESCRIPTION OF ECO'S.

In the SPA mode of the present invention, wherein only the rotamericembodiment is relevant, ECO's are systematically generated for apre-defined set P of patterns being substituted into the referencestructure of a protein of interest. Also, a distinction needs to be madebetween single and double residue ECO's. Multi-residue ECO's are notconsidered in the SPA mode only because their systematical computationmay be too slow.

For single ECO's, the set P is constructed as follows. First, a set I ofresidue positions i is defined, where I is a subset of N (the positionsof interest, actually all modeled residues). At each position i ∈ I, aset A(i) of residue types of interest is defined. A residue type aplaced at position i is noted as i^(a), where i^(a) ∈ A(i) and wherei^(a) can be interpreted as a single residue pattern in an as yetundefined conformation. It is noteworthy that at different positions i,different sets of types A(i) may be considered. Next, for each definedposition/type combination i^(a), a set R(i^(a)) comprising rotamers ofinterest for type a at position i is established, preferably retrievedfrom a rotamer library. A rotamer r, generated for type a, placed atposition i is noted as i_(r) ^(a), where i_(r) ^(a) ∈ R(i^(a)) and wherei_(r) ^(a) can be interpreted as an unambiguous single residue pattern p∈ P.

For double ECO's, basically the same approach is followed, but allcombinations of types and rotamers need to be considered for allpossible pairs of positions. More specifically, define by i, j ∈ I apair of residue positions, by i^(a) ∈ A(i) and, a A(j) a pair of residuetypes at positions i and j, respectively, and by i_(r) ^(a) ∈R(i^(a))and j_(s) ^(b) ∈ R(j^(b)) a pair of rotamers for types a and b atpositions i and j, respectively, then the set P consists of all possibleunique and unambiguous pair-residue patterns [i_(r) ¹,j_(s) ^(b)], wherei<j.

In the SPA mode, the reference structure for the protein of interestneeds to be generated only once (see REFERENCE STRUCTURE) and theassociated reference energy E_(ref) can be used in the calculation ofthe ECO energy value for all patterns of the set P. Then all patterns ofP are substituted into the reference structure, the structure is adaptedto the pattern and the total energy is calculated using equation (5) or(7) for single or double-residue patterns, respectively. This yieldsnecessary and sufficient data to calculate the ECO energy associatedwith each single and/or double residue pattern, using eqs. (6) and/or(8), respectively.

Finally, the generated data are stored on a suitable storage device(preferably into an array, a file or a database) where typically oneobject per pattern is created; each object contains (some of) the dataand references described in the section FORMAL DESCRIPTION OF ECO'S.Post-analysis operations including transformations (see DESCRIPTION OFTHE ECO CONCEPT) or grouping operations (see GROUPED ECO'S) may then beapplied.

Grouped ECO'S (GECO'S)

Besides energetic transformations (see DESCRIPTION OF THE ECO CONCEPT),the most important post-analysis method, applicable to ECO's generatedin the SPA-mode of the present invention, is the grouping of ECO's intoGECO's. While there is a nearly infinite variety of possible groupingoperations, we shall specifically address GECO's which are preferred inview of their practical usefulness and ease of interpretation i.e. (i)single GECO's, (ii) double GECO's and (iii) GECO's resulting from agrouping operation performed on the basis of residue type properties.

A single GECO, denoted G(i), is a GECO derived from an ensemble ofsingle residue ECO's, generated in SPA-mode, for a specific amino acidresidue type a placed at position i, and a set of rotamers R(i^(a)),i.e. the patterns of this ensemble only differ in rotameric state. EachECO of this ensemble can be seen as the result of a trial experimentwherein the compatibility of its associated pattern, i.e. a particularresidue type a at position i in rotameric state r, is investigatedwithin the context of an adaptive reference structure. A clear measurefor the said compatibility is the energy of each ECO of this ensemble,E_(ECO)(i_(r) ^(a)). In a preferred embodiment, the set of E_(ECO)(i_(r)^(a)) values for a given residue position and type i^(a) and allrotamers r n R(i^(a)) can simply be grouped by searching over allrotameric states for the minimum value for E_(ECO)(i_(r) ^(a)) accordingto equation (10):

E _(GECO)(i ^(a))=min_(r) E _(ECO)(i_(r) ^(a))   (eq.10)

wherein E_(ECO)(i_(a)) is the energy value resulting from the minimumsearching operation on all E_(ECO)(i_(r) ^(a)) values. If desired, themin-rotamer r_(min) can replace the pattern rotamer element in a GECOobject (element (XI) in the FORMAL DESCRIPTION OF ECO'S) but this isonly optional since a GECO is primarily intended to assess thecompatibility of an amino acid type at a specific position, rather thanthat of a conformation. A numerical illustration of such groupingoperation is shown in the following table 2.

Another preferred method to group ECO's is by taking the average over aset of low-energy values, wherein the latter set includes all values forE_(ECO)(i_(r) ^(a)) below a suitably chosen threshold value T, accordingto equation (11).

E _(GECO)(i^(a))=aver_(r)(E _(ECO)(i_(r) ^(a))|E _(ECO)(i_(r) ^(a))≦T  (eq.11)

wherein aver_(r) is the averaging operator working on E_(ECO)(i_(r)^(a)) values below T for all rotamers r at i^(a). When applying equation(11), some residue types a may not be compatible with position i, i.e.if no E_(ECO)(i_(r) ^(a)) value exists below T. When applying thisgrouping method, it is impossible to assign a specific rotamer r to theresulting GECO, but this is irrelevant if the grouping process is onlyfor the purpose of assessing the compatibility of a amino acid types.

Another preferred grouping method is based on using a probabilisticaveraging function such as shown in equation (12):

E _(GECO)(i ^(a))=Σ_(r) E _(ECO)(i_(r) ^(a))p(i_(r) ^(a))   (eq.12)

wherein

p(hd r ^(a))=exp(−βE _(ECO)(i _(r) ^(a)))/Z(i _(r) ^(a))   (eq.13)

and wherein β is a suitable constant (e.g. β=1/(k_(β)T) wherein k_(B) isBoltzmann's constant and T is the temperature in degrees Kelvin) andZ(i_(r) ^(a)) is the partition function defined by equation (14) as:

Z(i _(r) ^(a))=Σ_(r) exp(−βE _(ECO)(_(r) ^(a))   (eq.14)

This grouping method is highly preferred since it is statisticallysupported and requires no absolute energetic threshold value.

A double GECO, denoted G([i^(a),j^(b)]), is a grouped ECO derived froman ensemble of double residue ECO's, generated in SPA-mode, for pairresidue patterns implying [i^(a),j^(b)]. G([i^(a),j^(b)]) which providesinformation regarding the compatibility of the pair of residue types aand b, at positions i and j, respectively, without necessarilyspecifying their rotameric states. The possibilities to group doubleECO's for pair residue patterns [i_(r) ^(a),j_(s) ^(b)] are analogous tothose for grouping single ECO's. In a first embodiment, the minimumenergy value for all pair-wise combinations of rotamers r and s can besearched according to equation (15):

E _(GECO)([i _(r) ^(a) ,j _(s) ^(b)])=min_(r,s) E _(ECO)([i _(r) ^(a) ,j_(s) ^(b)])   (eq.15)

wherein min_(r,s) is the minimum searching operator.

A second embodiment involves averaging combinations having a pair energybelow a given threshold, following equation (16):

E _(GECO)([i _(r) ^(a) ,j _(s) ^(b))])=aver_(r,s)(E _(ECO)([i _(r) ^(a),j _(s) ^(b)])|E _(ECO)([i _(r) ^(a) ,j _(s) ^(b)])≦T)   (eq. 16)

wherein aver_(r) is the averaging operator.

A third embodiment is a statistical averaging according to equation(17):

E _(GECO)([i _(r) ^(a) ,j _(s) ^(b)])=Σ_(r,s) E _(ECO)([i _(r) ^(a) ,j_(s) ^(b)])p([i _(r) ^(a) ,j _(s) ^(b)])   (eq.17)

wherein

p([i _(r) ^(a) ,j _(s) ^(b)])=exp(−βE _(ECO)([i _(r) ^(a) ,j _(s)^(b)]))/Z([i _(r) ^(a) , _(s) ^(b)])   (eq.18)

and the partition function Z([i^(r) ^(a),j_(s) ^(b)]) is

Z([i _(r) ^(a) , _(s) ^(b)])=Σ_(r,s) exp(−βE _(ECO)([i _(r) ^(a) ,j _(s)^(b)]))   (eq. 19)

Another type of grouping is at the level of amino acid features orproperties rather than rotamers, which is useful in assessing thecompatibility of a certain type of amino acid, e.g. an aromatic aminoacid, at residue position i. Indeed, GECO's can be constructed bygrouping amino acids sharing some feature or property of interest. Forexample, G(i^(Pol)) where Pol={S,T,N,D,E,Q,R,K} denotes a GECO for apolar amino acid at position i. The energy value associated with such aGECO can be suitably defined by equation (20) as:

E _(GECO)(i ^(Kind))=aver_(a∈Kind) E _(GECO)(i ^(a))   (eq.20)

Such a grouping is preferably performed on GECO's already grouped bytype (over rotameric states). Other potentially useful groupingoperations include, but are not limited to, sets of:

-   aromatic amino acid types where Kind={H,F,Y,W},-   small residue types where Kind={G,A} or {G,A,S},-   β-branched types where Kind={I,V,T},    and the like.

Generally speaking, this latter type of grouping is intended for andsuitable to condense primary information into structured information ofa higher level which may be useful, for example, in fold recognition byfast screening of amino acid sequences against structure-based profiles.

Assessing Global Energies From Single And Double (G)ECO'S

The present invention also relates to a method to assess the globalfitness of a protein structure in which an arbitrary number ofsubstitutions have taken place compared to a reference amino acidsequence. More specifically, it is possible to assess the global energy,and thereby the fitness, of a multiply substituted protein structure bymaking use of single and preferably also double ECO or GECO energyvalues. The computation of this global energy may occur by simplysumming energetic terms for single residues and pairs of residues asdemonstrated hereinafter. As a consequence, the compatibility of a givenamino acid sequence with a given protein 3D structure, after properalignment, can be assessed fastly, compared to current methods usingconventional modelling techniques.

First, the assessment method of the invention establishes a mathematicalrelationship between single and double ECO energies. Then thisrelationship is further analyzed, thereby demonstrating how double ECOenergies can be derived from single ECO energies. Next, this method isextended so as to compute the energy of multi-residue patterns fromsingle ECO energies. Finally, it is discussed how single and doubleGECO's can be used to assess the global energy of a multiply substitutedprotein structure.

In the preferred embodiment of the present invention, i.e. when usingrotamers, a fixed template and a pair-wise energy function, the globalenergy of a reference amino acid sequence in its reference structure isgiven by equation (4). When isolating, in this equation, the energeticcontributions for two residue positions of interest i and j, we obtainequation (21):

$\begin{matrix}{{{E_{ref} = {E_{template} + {\Sigma_{k}{E_{t}\left( k_{g}^{r} \right)}} + {\Sigma_{k}\Sigma_{i > k}{E_{p}\left( {k_{g}^{r},l_{g}^{r}} \right)}} + {E_{t}\left( i_{g}^{r} \right)} + {E_{t}\left( j_{g}^{r} \right)} + {E_{p}\left( {i_{g}^{r},j_{g}^{r}} \right)} + {\Sigma_{k}{E_{p}\left( {i_{g}^{r},k_{g}^{r}} \right)}} + {\Sigma_{k}{E_{p}\left( {j_{g}^{r},k_{g}^{r}} \right)}}}};{i \neq j};k},{l \neq i},j} & \left( {{eq}.\mspace{11mu} 21} \right)\end{matrix}$

wherein all terms and symbols have the same meaning as in equation (4).

A similar equation for E(i_(r) ^(a)) can be derived from equation (5),wherein the contributions for the residue positions i and j areisolated, recalling that E(i_(r) ^(a)) is the energy for the referencestructure in which the reference rotamer i_(g) ^(r) has been replaced bythe pattern rotamer i_(r) ^(a) and where the other residues have had theopportunity to assume a new lowest energy rotamer g′ which may or maynot be different from the reference rotamer g. Thus, we can writeE(i_(r) ^(a)) as

$\begin{matrix}{{{{E\left( i_{r}^{a} \right)} = {E_{template} + {\Sigma_{k}{E_{t}\left( k_{g^{\prime}}^{r} \right)}} + {\Sigma_{k}\Sigma_{l > k}{E_{p}\left( {k_{g^{\prime}}^{r},l_{g^{\prime}}^{r}} \right)}} + {E_{t}\left( i_{r}^{a} \right)} + {E_{t}\left( j_{g^{\prime}}^{r} \right)} + {E_{p}\left( {i_{r}^{a},j_{g^{\prime}}^{r}} \right)} + {\Sigma_{k}{E_{p}\left( {i_{r}^{a},k_{g^{\prime}}^{r}} \right)}} + {\Sigma_{k}{E_{p}\left( {j_{g^{\prime}}^{r},k_{g^{\prime}}^{r}} \right)}}}};{i \neq j};k},{l \neq i},j} & \left( {{eq}.\mspace{11mu} 22} \right)\end{matrix}$

Likewise, we obtain for E(j_(s) ^(b)), i.e. the global energy of thereference structure substituted by and adapted to the pattern j_(s)^(b), wherein possible rotameric adaptations are denoted by thesubscript g″:

$\begin{matrix}{{{{E\left( j_{s}^{b} \right)} = {E_{template} + {\Sigma_{k}{E_{t}\left( k_{g^{*}}^{r} \right)}} + {\Sigma_{k}\Sigma_{l > k}{E_{p}\left( {k_{g^{*}}^{r},l_{g^{*}}^{r}} \right)}} + {E_{t}\left( i_{g^{*}}^{r} \right)} + {E_{t}\left( j_{s}^{b} \right)} + {E_{p}\left( {i_{g^{*}}^{r},j_{s}^{b}} \right)} + {\Sigma_{k}{E_{p}\left( {i_{g^{*}}^{r},k_{g^{*}}^{r}} \right)}} + {\Sigma_{k}{E_{p}\left( {j_{s}^{b},k_{g^{*}}^{r}} \right)}}}};{i \neq j};k},{l \neq i},j} & \left( {{eq}.\mspace{11mu} 23} \right)\end{matrix}$

The energy of the pair pattern [i_(r) ^(a),j_(s) ^(b)], causingrotameric adaptations denoted by the subscript g′″ is given by equation(24):

$\begin{matrix}{{{{E\left( \left\lbrack {i_{r}^{a},j_{s}^{b}} \right) \right\rbrack} = {E_{template} + {\Sigma_{k}{E_{t}\left( k_{g^{''\prime}}^{r} \right)}} + {\Sigma_{k}\Sigma_{l > k}{E_{p}\left( {k_{g^{''\prime}}^{r},l_{g^{''\prime}}^{r}} \right)}} + {E_{t}\left( i_{r}^{a} \right)} + {E_{t}\left( j_{s}^{b} \right)} + {E_{p}\left( {i_{r}^{a},j_{s}^{b}} \right)} + {\Sigma_{k}{E_{p}\left( {i_{r}^{a},k_{g^{''\prime}}^{r}} \right)}} + {\Sigma_{k}{E_{p}\left( {j_{s}^{b},k_{g^{''\prime}}^{r}} \right)}}}};{i \neq j};k},{l \neq i},j} & \left( {{eq}.\mspace{11mu} 24} \right)\end{matrix}$

If the expression (21) for E_(ref) is subtracted from equations (22) to(24), we obtain the equations for the single ECO energies E_(ECO)(i_(r)^(a)) and E_(ECO)(j_(s) ^(b)) and for the double ECO energyE_(ECO)([i_(r) ^(a),j_(s) ^(b)) as follows:

$\begin{matrix}{{{{E_{ECO}\left( i_{r}^{a} \right)} = {{\Sigma_{k}\left( {{E_{t}\left( k_{g^{\prime}}^{r} \right)} - {E_{t}\left( k_{g}^{r} \right)}} \right)} + {\Sigma_{k}{\Sigma_{l > k}\left( {{E_{p}\left( {k_{g^{\prime}}^{r},l_{g^{\prime}}^{r}} \right)} - {E_{p}\left( {k_{g}^{r},l_{g}^{r}} \right)}} \right)}} + \left( {{E_{t}\left( i_{r}^{a} \right)} - {E_{t}\left( i_{g}^{r} \right)}} \right) + \left( {{E_{t}\left( j_{g^{\prime}}^{r} \right)} - {E_{t}\left( j_{g}^{r} \right)}} \right) + \left( {{E_{p}\left( {i_{r}^{a},j_{g^{\prime}}^{r}} \right)} - {E_{p}\left( {i_{g}^{r},j_{g}^{r}} \right)}} \right) + {\Sigma_{k}\left( {{E_{p}\left( {i_{r}^{a},k_{g^{\prime}}^{r}} \right)} - {E_{p}\left( {i_{g}^{r},k_{g}^{r}} \right)}} \right)} + {\Sigma_{k}\left( {{E_{p}\left( {j_{g^{\prime}}^{r},k_{g^{\prime}}^{r}} \right)} - {E_{p}\left( {j_{g}^{r},k_{g}^{r}} \right)}} \right)}}};{i \neq j};k},{l \neq i},j} & \left( {{eq}.\mspace{11mu} 25} \right) \\{{{{E_{ECO}\left( j_{s}^{b} \right)} = {{\Sigma_{k}\left( {{E_{t}\left( k_{g^{''}}^{r} \right)} - {E_{t}\left( k_{g}^{r} \right)}} \right)} + {\Sigma_{k}{\Sigma_{l > k}\left( {{E_{p}\left( {k_{g^{''}}^{r},l_{g^{''}}^{r}} \right)} - {E_{p}\left( {k_{g}^{r},l_{g}^{r}} \right)}} \right)}} + \left( {{E_{t}\left( i_{g^{''}}^{r} \right)} - {E_{t}\left( i_{g}^{r} \right)}} \right) + \left( {{E_{t}\left( j_{s}^{b} \right)} - {E_{t}\left( j_{g}^{r} \right)}} \right) + \left( {{E_{p}\left( {i_{g^{''}}^{r},j_{s}^{b}} \right)} - {E_{p}\left( {i_{g}^{r},j_{g}^{r}} \right)}} \right) + {\Sigma_{k}\left( {{E_{p}\left( {i_{r}^{a},k_{g^{\prime}}^{r}} \right)} - {E_{p}\left( {i_{g}^{r},k_{g}^{r}} \right)}} \right)} + {\Sigma_{k}\left( {{E_{p}\left( {j_{s}^{b},k_{g^{''}}^{r}} \right)} - {E_{p}\left( {j_{g}^{r},k_{g}^{r}} \right)}} \right)}}};{i \neq j};k},{l \neq i},j} & \left( {{eq}.\mspace{11mu} 26} \right) \\{{{{E_{ECO}\left( \left\lbrack {i_{r}^{a},j_{s}^{b}} \right\rbrack \right)} = {{\Sigma_{k}\left( {{E_{t}\left( k_{g^{-}}^{r} \right)} - {E_{t}\left( k_{g}^{r} \right)}} \right)} + {\Sigma_{k}{\Sigma_{l > k}\left( {{E_{p}\left( {k_{g^{''\prime}}^{r},l_{g^{''\prime}}^{r}} \right)} - {E_{p}\left( {k_{g}^{r},l_{g}^{r}} \right)}} \right)}} + \left( {{E_{t}\left( i_{r}^{a} \right)} - {E_{t}\left( i_{g}^{r} \right)}} \right) + \left( {{E_{t}\left( j_{s}^{b} \right)} - {E_{t}\left( j_{g}^{r} \right)}} \right) + \left( {{E_{p}\left( {i_{r}^{a},j_{s}^{b}} \right)} - {E_{p}\left( {i_{g}^{r},j_{g}^{r}} \right)}} \right) + {\Sigma_{k}\left( {{E_{p}\left( {i_{r}^{a},k_{g^{''\prime}}^{r}} \right)} - {E_{p}\left( {i_{g}^{r},k_{g}^{r}} \right)}} \right)} + {\Sigma_{k}\left( {{E_{p}\left( {j_{s}^{b},k_{g^{''\prime}}^{r}} \right)} - {E_{p}\left( {j_{g}^{r},k_{g}^{r}} \right)}} \right)}}};{i \neq j};k},{l \neq i},j} & \left( {{eq}.\mspace{11mu} 27} \right)\end{matrix}$

Each of equations (25) to (27) consists of a large number of terms whichhave been ordered specifically to make them more interpretable, asfollows: the first line in each equation contains terms related tonon-pattern residue positions k,l≠i,j only; the second line, incontrast, contains only terms related to pattern residues i,j; andfinally, the third line contains interactions between pattern positionsif and non-pattern positions k≠i,j. Moreover, all terms are grouped inpairs wherein the reference contributions “g” are subtracted from thepattern related terms “g′, g″, g′″”, i.e. each pair of terms thereforereflects the difference brought about by a pattern substitution. Bearingthis in mind, we can substitute

$\begin{matrix}{{\Delta \; {E_{self}^{k}(p)}} = \left\{ \begin{matrix}{{{\Sigma_{k}\begin{pmatrix}{{E_{t}\left( k_{g^{\prime}}^{r} \right)} -} \\{E_{t}\left( k_{g}^{r} \right)}\end{pmatrix}} + {\Sigma_{k}{\Sigma_{l > k}\left( {{E_{p}\left( {k_{g^{\prime}}^{r},l_{g^{\prime}}^{r}} \right)} - {E_{p}\left( {k_{g}^{r},l_{g}^{r}} \right)}} \right)}}};} & {p = i_{r}^{a}} \\{{{\Sigma_{k}\begin{pmatrix}{{E_{t}\left( k_{g^{''}}^{r} \right)} -} \\{E_{t}\left( k_{g}^{r} \right)}\end{pmatrix}} + {\Sigma_{k}{\Sigma_{l > k}\left( {{E_{p}\left( {k_{g^{''}}^{r},l_{g^{''}}^{r}} \right)} - {E_{p}\left( {k_{g}^{r},l_{g}^{r}} \right)}} \right)}}};} & {p = j_{s}^{b}} \\{{{\Sigma_{k}\begin{pmatrix}{{E_{t}\left( k_{g^{''\prime}}^{r} \right)} -} \\{E_{t}\left( k_{g}^{r} \right)}\end{pmatrix}} + {\Sigma_{k}{\Sigma_{l > k}\left( {{E_{p}\left( {k_{g^{''\prime}}^{r},l_{g^{''\prime}}^{r}} \right)} - {E_{p}\left( {k_{g}^{r},l_{g}^{r}} \right)}} \right)}}};} & {p\left\lbrack {i_{r}^{a},j_{s}^{b}} \right\rbrack}\end{matrix} \right.} & \left( {{eq}.\mspace{11mu} 28} \right)\end{matrix}$

where ΔE_(self) ^(k)(p) is preferably interpreted as the difference inthe global internal (or “self”) energy of the non-pattern residuesk,l≠i,j (compared to the reference structure) due to conformationalrearrangements induced by pattern p, where p is either i_(r) ^(a),j_(s)^(b) or [i_(r) ^(a),j_(s) ^(b)]. Similarly, we can substitute

$\begin{matrix}{{\Delta \; {E_{int}^{{ij}\text{-}k}(p)}} = \left\{ \begin{matrix}{{{\Sigma_{k}\begin{pmatrix}{{E_{p}\left( {i_{r}^{a},k_{g^{\prime}}^{r}} \right)} -} \\{E_{p}\left( {i_{g}^{r},k_{g}^{r}} \right)}\end{pmatrix}} + {\Sigma_{k}\left( {{E_{p}\left( {j_{g^{\prime}}^{r},k_{g^{\prime}}^{r}} \right)} - {E_{p}\left( {j_{g}^{r},k_{g}^{r}} \right)}} \right)}};} & {p = i_{r}^{a}} \\{{{\Sigma_{k}\begin{pmatrix}{{E_{p}\left( {i_{g^{''}}^{r},k_{g^{''}}^{r}} \right)} -} \\{E_{p}\left( {i_{g}^{r},k_{g}^{r}} \right)}\end{pmatrix}} + {\Sigma_{k}\left( {{E_{p}\left( {j_{s}^{b},k_{g^{''}}^{r}} \right)} - {E_{p}\left( {j_{g}^{r},k_{g}^{r}} \right)}} \right)}};} & {p = j_{s}^{b}} \\{{{\Sigma_{k}\begin{pmatrix}{{E_{p}\left( {i_{r}^{a},k_{g^{''\prime}}^{r}} \right)} -} \\{E_{p}\left( {i_{g}^{r},k_{g}^{r}} \right)}\end{pmatrix}} + {\Sigma_{k}\left( {{E_{p}\left( {j_{s}^{b},k_{g^{''\prime}}^{r}} \right)} - {E_{p}\left( {j_{g}^{r},k_{g}^{r}} \right)}} \right)}};} & {p\left\lbrack {i_{r}^{a},j_{s}^{b}} \right\rbrack}\end{matrix} \right.} & \left( {{ep}.\mspace{11mu} 29} \right)\end{matrix}$

where ΔEE_(int) ^(ij−k)(p) is preferably interpreted as the differencein total interaction energy between the pattern residues i,j and thenon-pattern residues k≠i,j (compared to the reference structure) due toconformational rearrangements induced by pattern p, where p is eitheri_(r) ^(a),j_(s) ^(b) or [i_(r) ^(a),j_(s) ^(b)]. When entering theformer two quantities into equations (25) to (27), we obtain

$\begin{matrix}{{E_{ECO}\left( i_{r}^{a} \right)} = {{\Delta \; {E_{self}^{k}\left( i_{r}^{a} \right)}} + \left( {{E_{t}\left( i_{r}^{a} \right)} - {E_{t}\left( i_{g}^{r} \right)}} \right) + \left( {{E_{t}\left( j_{g^{\prime}}^{r} \right)} - {E_{t}\left( j_{g}^{r} \right)}} \right) + \left( {{E_{p}\left( {i_{r}^{a},j_{g^{\prime}}^{r}} \right)} - {E_{p}\left( {i_{g}^{r},j_{g}^{r}} \right)}} \right) + {\Delta \; {E_{int}^{{ij}\text{-}k}\left( i_{r}^{a} \right)}}}} & \left( {{eq}.\mspace{11mu} 30} \right) \\{{E_{ECO}\left( j_{s}^{b} \right)} = {{\Delta \; {E_{self}^{k}\left( j_{s}^{b} \right)}} + \left( {{E_{t}\left( i_{g^{\prime}}^{r} \right)} - {E_{t}\left( i_{g}^{r} \right)}} \right) + \left( {{E_{t}\left( j_{s}^{b} \right)} - {E_{t}\left( j_{g}^{r} \right)}} \right) + \left( {{E_{p}\left( {i_{g^{''}}^{r},j_{s}^{b}} \right)} - {E_{p}\left( {i_{g}^{r},j_{g}^{r}} \right)}} \right) + {\Delta \; {E_{int}^{{ij}\text{-}k}\left( j_{s}^{b} \right)}}}} & \left( {{eq}.\mspace{11mu} 31} \right) \\{{E_{ECO}\left( \left\lbrack {i_{r}^{a},j_{s}^{b}} \right\rbrack \right)} = {{\Delta \; {E_{self}^{k}\left( \left\lbrack {i_{r}^{a},j_{s}^{b}} \right\rbrack \right)}} + \left( {{E_{t}\left( i_{r}^{a} \right)} - {E_{t}\left( i_{g}^{r} \right)}} \right) + \left( {{E_{t}\left( j_{s}^{b} \right)} - {E_{t}\left( j_{g}^{r} \right)}} \right) + \left( {{E_{p}\left( {i_{r}^{a},j_{s}^{b}} \right)} - {E_{p}\left( {i_{g}^{r},j_{g}^{r}} \right)}} \right) + {\Delta \; {E_{int}^{{ij}\text{-}k}\left( \left\lbrack {i_{r}^{a},j_{s}^{b}} \right\rbrack \right)}}}} & \left( {{eq}.\mspace{11mu} 32} \right)\end{matrix}$

This set of equations now allows to write a double ECO energy as afunction of single ECO energies. For this purpose, equations (30) and(31) are subtracted from equation (32).

$\begin{matrix}{{{E_{ECO}\left( \left\lbrack {i_{r}^{a},j_{s}^{b}} \right\rbrack \right)} - {E_{ECO}\left( i_{r}^{a} \right)} - {E_{ECO}\left( j_{s}^{b} \right)}} = {{\Delta \; {E_{self}^{k}\left( \left\lbrack {i_{r}^{a},j_{s}^{b}} \right\rbrack \right)}} - \left( {{\Delta \; {E_{self}^{k}\left( i_{r}^{a} \right)}} + {\Delta \; {E_{self}^{k}\left( j_{s}^{b} \right)}}} \right) + {\Delta \; {E_{int}^{{ij}\text{-}k}\left( \left\lbrack {i_{r}^{a},j_{s}^{b}} \right\rbrack \right)}} - \left( {{\Delta \; {E_{int}^{{ij} - k}\left( i_{j}^{a} \right)}} + {\Delta \; {E_{int}^{{ij} - k}\left( j_{s}^{b} \right)}}} \right) - \left( {{E_{t}\left( j_{g^{\prime}}^{r} \right)} - {E_{t}\left( j_{g}^{r} \right)}} \right) - \left( {{E_{t}\left( i_{g^{''}}^{r} \right)} - {E_{t}\left( i_{g}^{r} \right)}} \right) + \left( {{E_{p}\left( {i_{r}^{a},j_{s}^{b}} \right)} - {E_{p}\left( {i_{g}^{r},j_{g}^{r}} \right)}} \right) - \left( {{E_{p}\left( {i_{r}^{a},j_{g}^{r}} \right)} - {E_{p}\left( {i_{g}^{r},j_{g}^{r}} \right)}} \right) - \left( {{E_{p}\left( {i_{g^{''}}^{r},j_{s}^{b}} \right)} - {E_{p}\left( {i_{g}^{r},j_{g}^{r}} \right)}} \right)}} & \left( {{eq}.\mspace{11mu} 33} \right)\end{matrix}$

The terms appearing in the second and third line of equation (33) may beanalysed as follows. First, they measure the energetic consequences forthe non-pattern residues k≠i,j due to the introduction of the patternsi_(r) ^(a),j_(s) ^(b) and [i_(r) ^(a),j_(s) ^(b)]. Secondly, it can besafely assumed that introducing any single or pair residue pattern isunlikely to change the conformation of an entire protein structure,whereas the introduction of e.g. a single residue pattern i_(r) ^(a) islikely to cause adaptation events for only a limited number of residuesk in its immediate environment. In other words, it is expected that aprotein structure shows some “plasticity” behaviour with respect toperturbations of its structure (i.e. the reference structure), i.e.small patterns are likely to induce small, local changes. Taking thisexpectation for true, it can also be expected that two different singleresidue patterns i_(r) ^(a) and j_(s) ^(b) may have non-overlappingassociated sets of adapted residues. Formally defining by K(p) the setof non-pattern residues k≠i,j which adopt a different rotameric state(compared to the reference structure) after the introduction of, andadaptation to, the pattern p, for the three possible sets of patternsdefined on i and/or j we have:

K(i _(r) ^(a))≡{k|k_(g′) ^(r) ,≠k _(g) ^(r)}  (eq.34)

K(j _(s) ^(b))≡{k|k _(g′) ^(r) ≠k _(g) ^(r)}  (eq.35)

K([i _(r) ^(a) ,j _(s) ^(b)])≡{k|k _(g″) ^(r) ≠k _(g) ^(r)}  (eq.36)

The condition for non-overlapping sets of adapted residues associatedwith patterns i_(r) ^(a) and j_(s) ^(b) is then represented by equation(37):

K(i _(r) ^(a))∩K(j _(s) ^(b))=Ø  (eq.37)

A question addressed by the method of the present invention is then todefine the relationship between the sets K(i_(r) ^(a)), K(j_(s) ^(b))and K([i_(r) ^(a),j_(s) ^(b)]) and, more specifically, the conformationof the elements in each set. Ignoring all other theoreticalpossibilities, we introduce herein an hypothesis called “additivity ofadaptation” (AA) which is illustrated in FIG. 3 and assumes thatconformational adaptation effects caused by a pair pattern [i_(r)^(a),j_(s) ^(b)] are identical to the sum of all adaptation effectscaused by each of its constituting single residue patterns i_(r) ^(a)and j_(s) ^(b). A first criterion for this hypothesis to be fulfilled isthat the introduction of a pair pattern [i_(r) ^(a),j_(s) ^(b)] affectsexactly the same residue positions k≠i,j as being affected by either thesingle residue pattern i_(r) ^(a) or j^(s) ^(b), which may be formallyrepresented by equation (38):

K(i _(r) ^(a))∪K(j _(s) ^(b))=K([i _(r) ^(a) ,j _(s) ^(b)])   (eq.38)

A further requirement for AA is that all rotamers k_(g″) ^(r)≠k_(g) ^(r)induced by the pattern [i_(r) ^(a),j_(s) ^(r)] are found among therotamers induced by either the single residue pattern i_(r) ^(a) orj_(s) ^(b). This criterion can be expressed formally by the followingequation:

$\begin{matrix}\left\{ \begin{matrix}{{k_{g^{''\prime}}^{r} = k_{g^{\prime}}^{r}};} & {\forall{k{{k \in {K\left( i_{r}^{a} \right)}}}}} \\{{k_{g^{''\prime}}^{r} = k_{g^{''}}^{r}};} & {\forall{k{{k \in {K\left( j_{s}^{b} \right)}}}}}\end{matrix} \right. & \left( {{eq}.\mspace{11mu} 39} \right)\end{matrix}$

This criterion is unambiguous if the conditions set forth in equations(37) and (38) are fulfilled. Then, under the AA hypothesis, theequations (30) to (32) can be drastically simplified, e.g. by combiningequations (28) and (39):

ΔE _(self) ^(k)([_(r) ^(a) ,j _(s) ^(b)])−ΔE _(self) ^(k)(i _(r)^(a))−ΔE _(self) ^(k)(j _(s) ^(b))=0   (eq.40)

and, similarly, by combining equations (29) and (39):

ΔE _(int) ^(ij−k)([i _(r) ^(a) ,j _(s) ^(b)])−ΔE _(int) ^(ij−k)(i _(r)^(a))−ΔE _(int) ^(ij−k)(j _(s) ^(b))=0   (eq.41)

so that equation (33) simplifies to:

$\begin{matrix}{{E_{ECO}\left( \left\lbrack {i_{r}^{a},j_{s}^{b}} \right\rbrack \right)} = {{E_{ECO}\left( i_{r}^{a} \right)} + {E_{ECO}\left( j_{s}^{b} \right)} - \left( {{E_{t}\left( j_{g^{\prime}}^{r} \right)} - {E_{t}\left( j_{g}^{r} \right)}} \right) - \left( {{E_{t}\left( i_{g^{''}}^{r} \right)} - {E_{t}\left( i_{g}^{r} \right)}} \right) + \left( {{E_{p}\left( {i_{r}^{a},j_{s}^{b}} \right)} - {E_{p}\left( {i_{g}^{r},j_{g}^{r}} \right)}} \right) - \left( {{E_{p}\left( {i_{r}^{a},j_{g^{\prime}}^{r}} \right)} - {E_{p}\left( {i_{g}^{r},j_{g}^{r}} \right)}} \right) - \left( {{E_{p}\left( {i_{g^{''}}^{r},j_{s}^{b}} \right)} - {E_{p}\left( {i_{g}^{r},j_{g}^{r}} \right)}} \right)}} & \left( {{eq}.\mspace{11mu} 42} \right)\end{matrix}$

25

When the AA criteria are met, equation (42) provides a means to computethe energy of a double ECO from the energies of single ECO's. Thisequation only requires terms directly involving pattern positions i andj, not terms related to non-pattern positions. The said terms involvingpattern positions i and j can be readily computed as soon as therotamers j_(g′) ^(r) (induced by i_(r) ^(a)) and i_(g″) ^(r) (induced byj_(s) ^(b) as well as the associated template (E_(t)) and pair (E_(p))energies are known and can be considered as corrections needed to derivedouble ECO energies from single ECO's. These terms can be grouped into anew function E_(fuse)(i_(r) ^(a),j_(s) ^(b)), comprising the saidenergetic corrections needed to fuse two single ECO energies into onedouble ECO energy, defined in equation (43):

E _(fuse)(i _(r) ^(a) ,j _(s) ^(b))=−E _(t)(j _(g′) ^(r))−E _(t) ⁽ i_(g″) ^(r))+E _(t)(i _(g) ^(r))+E _(t)(j _(g) ^(r))+E _(p)(i _(r) ^(a),j _(s) ^(b))−E _(p)(i _(r) ^(a) ,j _(g′) ^(r))−E _(P)(i _(g″) ^(r) ,j_(s) ^(b))+E _(p)(i _(g′) ^(r) ,j _(g) ^(r))   (eq.43)

Then, still under the AA hypothesis, equation (42) becomes:

E _(ECO)([i _(r) ^(a) ,j _(s) ^(b)])=E _(ECO)(i _(r) ^(a))+E _(ECO)(j_(s) ^(b))+E _(fuse)(i _(r) ^(a) ,j _(s) ^(b))   (eq.44)

When the AA criteria are not met as illustrated in FIG. 3 b, errors inequation (44) due to ignoring non-additivity effects can be included ina non-additivity term E_(non-AA)(i_(r) ^(a),j_(s) ^(b)) leading to auniversally applicable expression:

E _(ECO)([i _(r) ^(a) ,j _(s) ^(b)])=E _(ECO)(i _(r) ^(a))+E _(ECO)(j_(s) ^(b))+E _(fuse)(i _(r) ^(a) ,j _(s) ^(b))+E _(non-AA)(i _(r) ^(a),j _(s) ⁵)   (eq.45)

The first three terms at the right hand side of equation (45), as wellas the double ECO energy, can be exactly calculated. Therefore, equation(45) provides a practical means to statistically analyse possible errorsdue to neglecting non-additivity effects and to correlate these errorswith properties related to single residue patterns.

The present invention is also able to investigate in detail therelationship between single and double GECO's. This is of specialimportance because GECO's, being grouped over the rotameric dimension(r), are dependent only of a position (i) and type (a) and,consequently, allow for a huge reduction of data while keeping only themost relevant information. Since GECO's reflect how well residue types(not necessarily rotamers) can fit at specific positions in a proteinstructure, analysis of the relationship between single and double GECO'swould possibly make it possible to answer questions like:

-   -   If a double substitution [i^(a),j^(b)] is energetically        compatible, is each single substitution [i^(a)] and [j^(b)]        energetically compatible as well, and vice-versa?, and    -   can double GECO's be easily derived from single GECO's?

The ultimate aim of such analysis is to predict the energeticcompatibility of multiple substitutions from information related tosingle and double GECO's, thereby facilitating protein design and foldrecognition applications by allowing detailed, structure-basedpredictions without actually considering any 3D structure.

By analogy to equation (45), the following relationship between singleand double GECO energies is first assumed:

E _(GECO)([i ^(a) ,j ^(b)])=E _(GECO)(i ^(a))+E _(GECO)(j ^(b))+E_(fuse)(i ^(a) ,j ^(b))+E _(non-AA)(i ^(a) ,j ^(b))   (eq.4)

wherein E_(GECO)([i^(a),j^(b)]) is the energy of a GECO of a specialpattern p defined on residue positions i and/or j carrying types aand/or b, respectively, but where the rotameric states is are notnecessarily defined. The non-additivity term E_(non-AA)(i^(a),j^(b)) hasthe same meaning as previously, i.e. it reflects the energeticconsequences of a possible inconsistence between adaptations caused bypatterns i^(a) and j^(b) versus [i^(a),j^(b)]. Although some ambiguitymay lie in this term since the rotameric state of each of these patternsmay be undefined, except for GECO's derived by equations (10) and (15),however this term can simply be ignored since the method of the presentinvention is preferably used under the AA hypothesis. In equation (46),E_(fuse)(i^(a),j^(b)) may be considered as a correction term needed totake into account that i^(a) has been considered in the referencestructure where j had type r (and mutatis mutandis for j^(b)). Sinceboth types a and b are likely to be different from the correspondingtypes in the reference structure, this fuse-term can adopt a widevariety of values, both negative and positive. In contrast to theforegoing description, it would be in conflict with the basic GECOconcept to attempt to calculate E_(fuse)(i^(a),j^(b)) by similarity withequation (43) for the following reason. If the rotameric states of theinvolved residues are unknown or irrelevant, then it becomes impossibleor irrelevant to consider the associated rotamer/template androtamer/rotamer interaction energies as in equation (43). A much morepractical way to assign E_(fuse)(i^(a),j^(b)) values is to derive themdirectly from the single and double GECO energies according to equation(47):

E _(fuse)(i ^(a) ,j ^(b))=E _(GECO)(i ^(a) ,j ^(b))−(E _(GECO)(i^(a))+E_(GECO)(j ^(b)))   (eq.4)

which is true when there is additivity in the adaptation effects causedby the pair of single GECO's and the double GECO. However, when the AAhypothesis is not fulfilled, the energetic consequences (actually,E_(non-AA)(i^(a), j^(b))) will be incorporated into the fuse termE_(fuse)(i^(a),j^(b)). Anyhow, equation (47) can be used to computedouble GECO energies from a database containing single GECO energies andpair-wise fuse terms without effectively generating the double GECO.

Another aspect of the present invention is a method to compute theenergy of multiply substituted proteins from single and double GECOenergies. Considering that:

-   -   equation (45) means that the substitution energy of a multiplet        pattern (such as a pair) can be derived from the substitution        energy of basic patterns (i.e. single residues),    -   the present invention relates to energy functions which are        essentially pair-wise additive in the sense that the global        energy of a protein structure can be written in a form        equivalent to equation (4),    -   the fuse term in equation (45) mainly contains energetic        contributions which are directly and exclusively related to        pattern residue positions, and        the non-additivity effects are expected to be small in size        compared to the fuse term in eq (45).

Taking into account the previous elements and remarks, we postulate thatthe total substitution energy of a protein structure comprising amultiplet pattern of the type [i^(a),j^(b),k^(c), . . . ] can beapproximated by the equation

$\begin{matrix}{{E_{GECO}\left( \left\lbrack {i^{a},j^{b},k^{c},\ldots} \right\rbrack \right)} \approx {{E_{GECO}\left( i^{a} \right)} + {E_{GECO}\left( j^{b} \right)} + {E_{GECO}\left( k^{c} \right)} + \ldots + {E_{fusc}\left( {i^{a},j^{b}} \right)} + {E_{fusc}\left( {i^{a},k^{c}} \right)} + \ldots + {E_{fise}\left( {j^{b},k^{c}} \right)} + \ldots + \ldots}} & \left( {{eq}.\mspace{11mu} 48} \right)\end{matrix}$

If the pattern residue positions are not explicitly given a specificname, then equation (48) can be written in a simpler form:

E _(GECO)(p)≈Σ_(i) E _(GECO)(^(a(i)))+Σ_(i)Σ_(j>i) E _(fuse)(i ^(a(i)),j ^(b(j)))   (eq. 49)

where p is a multiplet pattern defined on any number of residuepositions of interest i and a residue type of interest a(i) is placed atposition i in an undefined rotameric state, and where E_(GECO)(i^(a(i)))is the single residue GECO energy for i^(a(i)), and whereE_(fuse)(i^(a(i)),j^(b(j)) is the fuse term for two single residuepatterns i^(a(i)) and j^(b(j)), preferably calculated from thecorresponding double GECO energy and both single GECO energies usingequation (47).

Clearly, equation (49) forms an essential aspect of the presentinvention since it indicates that the energetic compatibility of anyamino acid sequence of interest within the context of an adaptivereference structure can be assessed from only single and pair-wiseenergy terms (which are preferably stored in a database).

Potential errors in the calculation of E_(GECO)(_(P)) by means ofequation (49) compared to when p would be effectively substituted andmodeled into the reference structure, may result from different sources.However, the quantitative global error in equation (49) can be includedin an extra error term for the multiplet pattern p, E_(error)(p), sothat we obtain

E _(GECO)(p)=Σ_(i) E _(GECO)(i^(a(i)))+Σ_(i)Σ_(j>i) E _(fuse)(i ^(a(i)),j ^(b(j)))+E _(error)(p)   (eq. 50)

The latter equation provides a practical means to quantitativelyestimate the total error for a variety of different patterns p. Indeed,if E_(GECO)(p) computed with equation (49) is renamed as E_(GECO)^(approx)(p) and E_(GECO)(p) resulting from the true modeling of p inthe reference structure is referred to as E_(GECO) ^(mod)(p), then

E _(error)(p)=E _(GECO) ^(approx)(p)−E _(GECO) ^(mod)(p)   (eq. 51)

In the section FOLD RECOGNITION APPLICATION, a correlation is madebetween the approximate and modeled GECO energies for a selected proteinstructure and a wide variety of different patterns substituted into thetopologically most difficult part of the protein, i.e. the core.

Proof Of Principle

Hereinafter, the practical usefulness of the ECO concept with respect totwo important fields of application is illustrated. In a firstexperiment, a fold recognition simulation is performed by searching anear-optimal correlation between a target amino acid sequence ofinterest and a transformed-GECO profile derived for a protein structurewhich is homologous to the target sequence. In a second experiment, aprotein design experiment is carried out on three selected positions inthe B1 domain of protein G (PDB code 1PGA). More specifically, at threeselected residue positions 100 randomly chosen triple mutations weremodeled and the global energy of each modeled structure was compared tothe energy calculated on the basis of single and double GECO energies,in accordance with equation (49). From a correlation plot between thetwo sets of energy data, it follows that the global energy of a multiplysubstituted protein structure may be estimated from only single anddouble GECO energies with sufficient accuracy. Since the lattercomputations occur extremely fast compared to the effective modeling ofmultiple substitutions, the method based on single and double GECO's maybe useful to rapidly generate a set of amino acid sequences which arelikely to be compatible with a protein structure.

Fold Recognition Application

In order to exemplify the potential of the ECO concept with respect tofold recognition applications, single residue ECO's were generated forhen lysozyme (PDB code 3LZT) by applying the SPA embodiment of thepresent invention (see TABLE 1 for experimental settings and data).Next, the single residue ECO's were grouped into single residue GECO'sfor each residue position/type combination by searching the minimal ECOenergy value over all rotamers, in accordance with equation (10) (seeTable 2). The resulting set of GECO's is hereinafter referred to as theinitial GECO profile for the protein studied. The initial profile for3LZT was stored to disk so that different versions of a recognitionmethod (described below) could be tested using the same initial profile.

It has been observed that five different consecutive transformations ofGECO energies were helpful to obtain a near-optimal alignment between atarget amino acid sequence and a GECO profile. A first transformation,performed on the initial GECO profile for 3LZT, involved the translationof the distribution of GECO energies for a given residue position by thelowest value observed for that position over all residue types. Thetransformation equation used was

E _(GECO,1)(i ^(a))=E _(GECO)(i ^(a))−min_(a) E _(GECO)(i ^(a))   (eq.52)

where E_(GECO,1)(i^(a)) is the transformed GECO energy for residue typea at position i, E_(GECO)(i^(a)) is the initial GECO energy for i^(a)and min_(a) is the min-operator to search for the lowest possibleE_(GECO)(i^(a)) value at position i. As a result of this operation, alltransformed GECO energies became equal to or greater than zero. A secondtransformation included the truncation of all E_(GECO,1)(i^(a)) valuesto a maximal value of 100, as formally expressed by equation (53):

E _(GECO,2)(i ^(a))=min(E _(GECO,1)(i ^(a)),100)   (eq. 53)

where min is the min-operator acting on its two arguments, i.e.E_(GECO,1)(i^(a)) and a value of 100. As a result, strongly restrainedsubstitutions (see Table 2) received a default value of 100. A thirdtransformation was a logarithmic resealing of E_(GECO,2)(i^(a)) valuesby applying the following equation:

E _(GECO,3)(i ^(a))=log(1+E _(GECO,2)(i ^(a))×0.99)

whereby a E_(GECO,2)(i^(a)) interval [0,100] was mapped onto aE_(GECO,3)(i^(a)) interval [0,2] by a decimal logarithm function. Next,these values were converted to normalized units as follows: first, afourth transformation was performed on the distributions ofE_(GECO,3)(i^(a)) values for each position i. These distributions wereassumed to be Gaussian (which is, admittedly, an approximation) and theaverage value, aver_(a)(E_(GECO,3)(i^(a))), and standard deviation,std_(a)(E_(GECO,3)(i^(a))), were calculated for each position i over allresidue types a. Then, equation (55)

E _(GECO,4)(i ^(a))=(E _(GECO,3)(i ^(a))−aver_(a)(E_(GECO,3)(^(a))))/std_(a)(E _(GECO,3)(i ^(a)))   (eq. 55)

was performed on all E_(GECO,3)(i^(a)) values, which essentiallyrescales the transformed

GECO energies to a number of standard deviations above or below theaverage value observed for a position i. Negative E_(GECO,4)(i^(a))values may be associated with “favorable” or “better than average”substitutions whereas positive values may be seen as “unfavorable”substitutions. Next, a fifth transformation was performed on theE_(GECO,4)(i^(a)) values for the 20 residue type distributions by theequation

E _(GECO,5)(i^(a))=(E _(GECO,4)(i^(a))−aver_(i)(E_(GECO,4)(i^(a))))/std_(i)(E _(GECO,4)(i ^(a))) (eq. 56)

which rescales the E_(GECO,4)(i^(a)) values to a number of standarddeviations above or below the average value observed for a residue typea. The reason for the fourth transformation was to include a correctionfor the intrinsic level of difficulty to place different residue typesat a given position, while the fifth transformation intended to correctfor the intrinsic level of complexity to place a given residue type atdifferent positions. A set of E_(GECO,5)(i^(a)) values for all residuepositions i of a given protein structure and the 20 naturally occurringresidue types a at each position is hereinafter called a transformedGECO profile for the protein involved.

Target amino acid sequences were compared with transformed GECO profilesusing a modified Smith-Waterman (SW) alignment procedure in accordancewith R. Durbin et al. in Biological Sequence Analysis. Probabilisticmodels of proteins and nucleic acids (1998) at Cambridge UniversityPress. The modification consisted in that no gap formation was allowed,which is equivalent to assigning an infinite gap penalty value. Sincethe SW approach requires that (i) a favorable alignment of a residueagainst a given position in a profile receives a positive score value(whereas favorable E_(GECO,5)(i^(a)) values are negative) and (ii) theglobal expected value of score contributions should be negative (whereasthe global expected value for E_(GECO,5)(i^(a)) values is zero as aresult of the normalization operations), two additional transformationsof the E_(GECO,5)(i^(a)) values were needed. First, the sign of theE_(GECO,5)(i^(a)) values was inverted to meet the first requirement:

E _(GECO,6)(i^(a))=−E _(GECO,5)(i ^(a))   (eq. 57)

In addition, during construction of a SW alignment matrix, a value of0.5 was subtracted from the E_(GECO,6)(i^(a)) values to meet the secondrequirement. This means that a residue only contributes in a favorableway to an alignment if its E_(GECO,6)(i^(a)) value is greater than 0.5(i.e. if the log-transformed E_(GECO,3)(i^(a)) value is at least 0.5standard deviations below the average value at position i). However,since this transformation was only intended to identify high-scoringsequence fragments on the basis of maxima in the alignment matrix, inaccordance with the SW method, the non-transformed E_(GECO,6)(i^(a))values were used in all further calculations. From the SW alignmentmatrix, the 50 highest-scoring fragments were selected, beingcharacterized by (i) an offset in the target sequence, (ii) an offset inthe transformed GECO profile, (iii) a length l and (iv) a cumulativealignment score z being the sum of E_(GECO,6)(i^(a)) contributions forpositions i in the profile and residue types a observed in the targetsequence.

The obtained fragments were further statistically analyzed bycalculating the probability that they were found by pure chance. Theprobability P(l,z) that a randomly aligned fragment of a given length land consisting of uncorrelated, randomly selected residue types has analignment score equal to or better than a given score z can be derivedas follows. If two variables x₁ and x₂ are independent normal variableswith an average value of 0 and a standard deviation of 1, here writtenas N(0,1), then their sum is distributed as N(0,√{square root over(2)}), i.e. a distribution centered around 0, with a standard deviationequal to √{square root over (2)}, according to Neuts in Probability(1973) at Allyn and Bacon, Inc. (Boston). By extension, the sum of nsuch variables is distributed as N(0, √{square root over (n)}) Then, theprobability P(n,z) that the sum of n such variables is equal to orgreater than a value z, is given by the integration of the probabilitydensity function N(0, √{square root over (n)}):

$\begin{matrix}{{P\left( {n,z} \right)} = {\int_{z}^{+ \infty}{\frac{1}{\sqrt{2\pi \; n}}^{\frac{- x^{2}}{2n}}{x}}}} & \left( {{eq}.\mspace{11mu} 58} \right)\end{matrix}$

When applied to aligned fragments of length l having a cumulative scorez, and for which the constituent residues have a normal distributionN(0,1), the probability that the said value z or any greater value isobtained by pure chance is given by equation (59):

$\begin{matrix}{{P\left( {l,z} \right)} = {\int_{z}^{+ \infty}{\frac{1}{\sqrt{2\pi \; l}}^{\frac{- x^{2}}{2l}}{x}}}} & \left( {{eq}.\mspace{11mu} 59} \right)\end{matrix}$

The latter equation allowed to assign a P-value to all high-scoringfragments, which P-value corresponds to the probability that one randompositioning of a fragment of length l results in a score of at least z,which is different from the probability to obtain at least one fragmentof length l with a score of at least z (in the latter case, the size ofthe target sequence and the profile need to be considered). Theassignment of a P-value to each of the 50 highest-scoring fragmentsallowed to rank all fragments according to their significance. Finally,all fragments having a P-value below 10⁻⁴ were retained as potentiallysignificant.

The final global alignment was obtained by first clearing, in theaforementioned SW alignment matrix, all values for matrix elements notforming part of any retained fragment and, secondly, performing aNeedleman-Wunsch (NW) protocol to find the best possible concatenationof the retained fragments (R. Durbin et al., cited supra). The NWprotocol was performed using a gap opening penalty of 2.0 and a gapextension penalty of 0.2, resulting in a unique global alignment. Theassessment of the significance of a global alignment was performed bysumming the cumulative score for all fragments forming part of theglobal alignment, diminished by 2×(the number of fragments−1), furtherdiminished by 0.2×the total number of non-matched residue positionsbetween matched fragments, inputting this value into equation (59) inorder to obtain a global P-value and qualifying the result as a“positively recognized fold” if the global P-value is below 10⁻¹⁵.Alternatively, a fold was also considered as “positively recognized” ifthe best fragment had a P-value below 10⁻¹⁰. This alignment proceduremay not be optimal but that it forms a feasible and practically usefulmethod to illustrate the potential of the ECO concept with respect tofold recognition.

The aforementioned procedure for ECO-based analysis of the structuralcompatibility of a target amino acid sequence with a protein 3Dstructure was applied to a homologous pair of proteins from the C-typelysozyme family. Concretely, the sequence of human alpha-lactalbumin wasaligned with the transformed GECO profile of hen lysozyme (PDB code3LZT). The sequence identity between both proteins is 40% for the bestmatching fragment of 113 residues (out of 123 for lactalbumin and 129for lysozyme), according to a BLAST alignment performed at the internetaddress hftp://www.ncbi.nlm.nih.gov/. As a result of the alignmentprocedure, 10 aligned fragments were retained having a P-value below10⁻⁴ (FIG. 4). The best fragment had a P-value of 8.9×10⁻¹⁵, whichalready qualified the lactalbumin target sequence as being structurallycompatible with the 3LZT fold. Three other fragments (P-values inascending order were 1.5×10⁻¹¹, 6.3×10⁻⁷ and 1.9×10⁻⁵) also formed partof the final global alignment. Since the 3D-structure of humanalpha-lactalbumin is known (PDB code 1B9O), a structural comparison with3LZT was possible. It was found that the NW concatenation proceduresucceeded in selecting only those fragments which are indeed similar instructure. Non-matched regions (between the fragments) were observedtypically in loop regions where one or more deletions occurred withrespect to lysozyme. On the basis of this comparison, the alignment canbe clearly qualified as truly positive. The six fragments not formingpart of the final alignment were dissimilar in structure and werecorrectly excluded by the NW concatenation.

In order to investigate the discriminative power of the method withrespect to non-homologous sequences, the lactalbumin sequence has beenrandomly permuted 1000 times and each sequence was aligned against the3LZT profile. From each alignment, the best scoring fragment with thelowest P-value was retained. Out of these 1000 experiments, the bestrandom fragment had a P-value of 1.4×10⁻⁶ which agrees well with thefalse positive fragments generally appearing in each alignmentexperiment (FIG. 4). It also shows that the false positive fragment inthe lactalbumin/lysozyme alignment having an exceptional P-value of7.3×10⁻⁹ must due to intra-fragment correlation effects, i.e. thehigh-scoring lactalbumin fragment cannot be equivalent to a randomfragment. Since no structural similarity between the target and profilefragments could be observed, the nature of the presumed intra-fragmentcorrelation is as yet not understood. Moreover, portions of thisexceptional fragment also appeared in three other false positivefragments (i.e. fragments 5, 6 and 9 in FIG. 4), although their P-valuewas typical for random fragments, i.e. around 10⁻⁵.

In conclusion, these experiments show that the ECO concept is useful toidentify a structural relationship between two protein molecules byaligning a target amino acid sequence against a GECO profile. Moreover,a global structure-based alignment may be obtained. These results areobtained by only using single ECO's, more specifically the substitutionenergies of single residues in the context of a reference structurewhich may be up to 60% different in amino acid sequence compared to atarget sequence. When suitably including information derived from doubleECO's, the quality of the alignments should expectedly increase, sincedouble ECO's contain information related to pair-wise residue/residueinteractions.

Protein Design Application

The B1 domain of protein G (PDB code 1PGA) was chosen to simulate 100different randomly selected triple-residue substitutions and to comparethe global energy obtained by the effective modeling of eachtriple-substitution with the global energy estimated on the basis ofsingle and double GECO energies. This was not intended to predict themost stable variant but rather to show that protein design applicationsbased on single and double (G)ECO's form a valuable alternative comparedto previously known methods directly operating on a 3D structure,thereby being confronted with the combinatorial substitution problem.

First the GMEC of the 1PGA structure was computed using the FASTERmethod (such as previously described) and the same rotamer library andenergy function as in all other calculations (see Table 1). All residueshaving a rotatable side chain were included in the GMEC computation. Theglobal energy of this structure, which was taken as the referencestructure in all experiments related to 1PGA, was −149.3 kcal mol⁻¹.

Three proximate residue positions being located in the core of 1PGA wereselected to perform further design experiments. More specifically,residues 5 (LEU), 30 (PHE) and 52 (PHE), were chosen to be mutated into100 different random combinations of residue types. Each combination ishereinafter called a mutated sequence and represented as {X,Y,Z} whereX, Y and Z refer to the residue types placed at positions 5, 30 and 52,respectively. The WT “mutation” {LEU,PHE,PHE} was added to this set asan extra test. Each mutated sequence was modeled into 1PGA exactly inthe same way as the WT sequence, after having performed the necessaryside-chain replacements. These modeling experiments were executed by apure structure prediction method (i.e. the FASTER method) and not by anembodiment of the present invention: the conformation of the mutatedresidues was not kept fixed (in contrast to patterns) but was co-modeledwith that of all other rotatable side chains. Analysis of the resultsshowed that the global energies of the mutated sequences ranged from−149.3 to 22274 kcal mol⁻¹. However, all positive values were associatedwith sequences containing at least one PRO residue which is prohibitedin β-sheet (positions 5 and 52) and in an α-helix (position 30). Thehighest energy for sequences not containing PRO was −85.5 kcal mol⁻¹ andwas associated with the sequence {TRP,GLN,VAL}. The sequences with anegative energy clustered around −121.6±9.7 kcal mol⁻¹ (standarddeviation). No sequence had an energy better than the WT sequence, notsurprisingly since only 100 out of the 20³ possible sequences wereexamined and since the WT structure is well-packed.

In order to exemplify the use of the ECO concept with respect to proteindesign, it was investigated how the energies from the modeled sequencescould be approximated by using only single and double GECO energies. Forthis purpose, all possible single and double ECO's for the threeselected positions were generated in agreement with the rotameric SPAembodiment of the present invention. The same experimental settings wereapplied as in the previous fold recognition experiments. ECO's weregrouped into GECO's in accordance with equations (5) and (10) but notransformation was performed on the resulting GECO energies.

From the sets of calculated single residue GECO energies,E_(GECO)(i^(a)), and double residue GECO energies,E_(GECO)([i^(a),j^(b)]), an equivalent set of fuse terms,E_(fuse)(i^(a),j^(b)), was deduced by applying equation (47). The valuesfor all fuse terms related to positions 5 and 30 are listed in Table 4.Most of the fuse terms are negative, which means that effectivelymodeled double mutations are generally better than the mere sum of eachsingle substitution energy. Roughly speaking, negative values areexpected to arise from better interactions in a double substitutioncompared to each single substitution being modeled into the referencestructure, although a double substitution may also be energetically moreconstrained than each of the single mutations involved. In addition,some effects of non-additivity of adaptation may play a role as well.Yet, the fuse terms automatically correct for all these effects so that,together with the single GECO energies, they can be used “backwards” toreconstruct a double mutation without actually needing to model it.

The question of how equation (49) performs in the computation of theglobal energy of multiply substituted proteins was then investigated asfollows. Each of the 101 triplet mutations (100 random sequences plusthe WT sequence) was calculated using equation (49) and these energieswere compared to the energies resulting from the effective modeling in acorrelation diagram (FIG. 5). It was observed that both data setscorrelate surprisingly well, given the numerous possible sources ofapproximation. Out of the 101 sequences, 55 had a difference betweenmodeled and calculated energy less than 1 kcal mol⁻¹ in absolute valueand 25 more had a difference less than 5 kcal mol⁻¹. The presentinvention includes difference of less than 10 kcal mol⁻¹. Moreover, inthe most interesting range, i.e. for low-energy sequences, thecorrelation was even better. In the range up to 20 kcal mol⁻above theenergy of the WT sequence all 15 sequences differed in energy by lessthan 1 kcal mol⁻¹ in absolute value. Some larger differences wereobserved as well, but they were found exclusively in regions of higherenergy. Even for the most constrained sequences containing at least onePRO (15 sequences; results not shown in FIG. 5), 14 had a difference inenergy below 10 kcal mol⁻, while the energy of these modeled sequenceswas always higher than 400 kcal mol⁻¹. While most energy valuescalculated from single GECO energies and fuse terms were roughly evenlydistributed above and below the values from the modeled sequences, thelargest differences were always positive, which may be due tonon-additivity in the adaptations upon calculation of the double residueECO's (leading to an overestimation of the fuse terms). For comparison,the energy of each sequence calculated as the sum of each involvedsingle GECO energy (ignoring the fuse terms) was plotted in FIG. 5 aswell. Even here some correlation with the effectively modeled energiescan be observed but the percentual difference is significantly higher,which illustrates that the fuse terms contribute to the global energy ina most beneficial way.

In conclusion, the estimation of the global energy of multiplysubstituted proteins may be performed to sufficient accuracy by usinginformation derived from only single and double residue ECO's at leastin the case studied, i.e. the core of the B1 domain of protein G. It iswell-known in the art that core substitutions are among the mostdifficult ones since delicate packing effects are involved. Therefore,the present invention allows the design of new sequences, irrespectiveof the region where mutations are proposed. In addition, it is worthnoting that the present embodiment to derive global energies from singleand double residue GECO's is extremely fast when the latter values havebeen pre-calculated and can be retrieved from a file or database.Therefore, the present method may be useful to rapidly search a reducedset of structure-compatible sequences out of a combinatorial number ofpossibilities. Sequences obtained this way may then be submitted to moredetailed analysis.

The present invention may be implemented on a computing device e.g. apersonal computer or a work station which has an input device forloading the details of the reference structure and the reference aminoacid sequence defined in step A of claim 1, and also any further aminoacid sequences. The computing device is adapted to run software whichcarries out any of the methods in accordance with the present invention.The computer may be a server which is connected to a data communicationstransmission means such as the Internet. A script file including thedetails of the reference structure and the amino acid sequence may besent from one near location, e.g. terminal, to a remote, i.e. secondlocation, at which the server resides. The server receives this data andoutputs back along the communications line useful data to the nearterminal, e.g. the alignment of a sequence relative to the referencestructure, or a set of favourable amino acid sequences for one or moreresidues in the reference structure, a protein where its sequence or apart thereof is determined by one or more of the claimed methods, adatabase or part of a database including sets of ECO's for one or moreproteins; an ECO in the form of data structure.

FIG. 6 is a schematic representation of a computing system which can beutilized in accordance with the method and system of the presentinvention. The system and method provided by the present invention canbe implemented with the computing system depicted in FIG. 6. A computer10 is depicted which may include a video display terminal 14, a datainput means such as a keyboard 16, and a graphic user interfaceindicating means such as a mouse 18. Computer 10 may be implemented as ageneral purpose computer.

Computer 10 includes a Central Processing Unit (“CPU”) 15, such as aconventional microprocessor of which a Pentium III processor supplied byIntel Corp. USA is only an example, and a number of other unitsinterconnected via system bus 22. The computer 10 includes at least onememory. Memory may include any of a variety of data storage devicesknown to the skilled person such as random-access memory (“RAM”),read-only memory (“ROM”), non-volatile read/write memory such as a harddisc as known to the skilled person. For example, computer 10 mayfurther include random-access memory (“RAM”) 24, read-only memory(“ROM”) 26, as well as an optional display adapter 27 for connectingsystem bus 22 to an optional video display terminal 14, and an optionalinput/output (110) adapter 29 for connecting peripheral devices (e.g.,disk and tape drives 23) to system bus 22. Video display terminal 14 canbe the visual output of computer 10, which can be any suitable displaydevice such as a CRT-based video display well-known in the art ofcomputer hardware. However, with a portable or notebook-based computer,video display terminal 14 can be replaced with a LCD-based or a gasplasma-based flat-panel display. Computer 10 further includes userinterface adapter 30 for connecting a keyboard 16, mouse 18, optionalspeaker 36, as well as allowing optional physical value inputs fromphysical value capture devices 40 of an external system 20. The devices40 may be any suitable equipment for capturing physical parametersrequired in the execution of the present invention. Additional oralternative devices 41 for capturing physical parameters of anadditional or alternative external system 21 may also connected to bus22 via a communication adapter 39 connecting computer 10 to a datanetwork such as the Internet, an Intranet a Local or Wide Area network(LAN or WAN) or a CAN. The term “physical value capture device” includesdevices which provide values of parameters of a system, e.g. they may bean information network or databases such as a rotamer library.

Computer 10 also includes a graphical user interface that resides withina machine-readable media to direct the operation of computer 10. Anysuitable machine-readable media may retain the graphical user interface,such as a random access memory (RAM) 24, a read-only memory (ROM) 26, amagnetic diskette, magnetic tape, or optical disk (the last three beinglocated in disk and tape drives 23). Any suitable operating system andassociated graphical user interface (e.g., Microsoft Windows) may directCPU 15. In addition, computer 10 includes a control program 51 whichresides within computer memory storage 52. Control program 51 containsinstructions that when executed on CPU 15 carries out the operationsdescribed with respect to the methods of the present invention asdescribed above.

Those skilled in the art will appreciate that the hardware representedin FIG. 6 may vary for specific applications. For example, otherperipheral devices such as optical disk media, audio adapters, or chipprogramming devices, such as PAL or EPROM programming devices well-knownin the art of computer hardware, and the like may be utilized inaddition to or in place of the hardware already described.

In the example depicted in FIG. 6, the computer program product (i.e.control program 51 for executing methods in accordance with the presentinvention comprising instruction means in accordance with the presentinvention) can reside in computer storage 52. However, it is importantthat while the present invention has been, and will continue to be, thatthose skilled in the art will appreciate that the methods of the presentinvention are capable of being distributed as a program product in avariety of forms, and that the present invention applies equallyregardless of the particular type of signal bearing media used toactually carry out the distribution. Examples of computer readablesignal bearing media include: recordable type media such as floppy disksand CD ROMs and transmission type media such as digital and analoguecommunication links.

TABLE 1 I II III IV V VI VII VIII IX X XI XII XIII 3LZT SD200 1 allHFlex WT: ALA Faster 3LZT_G 32 GLY 1 Faster 1.944 3LZT SD200 1 allH FlexWT: ALA Faster 3LZT_G 32 ALA 1 Faster 0 3LZT SD200 1 allH Flex WT: ALAFaster 3LZT_G 32 PRO 1 Faster 670.4 3LZT SD200 1 allH Flex WT: ALAFaster 3LZT_G 32 PRO 2 Faster 2386 3LZT SD200 1 allH Flex WT: ALA Faster3LZT_G 32 VAL 1 Faster 32.75 3LZT SD200 1 allH Flex WT: ALA Faster3LZT_G 32 VAL 2 Faster 55.04 3LZT SD200 1 allH Flex WT: ALA Faster3LZT_G 32 VAL 3 Faster 128.3 3LZT SD200 1 allH Flex WT: ALA Faster3LZT_G 32 VAL 4 Faster 85.35 . . . 3LZT SD200 1 allH Flex WT: ALA Faster3LZT_G 32 SER 1 Faster 5.792 3LZT SD200 1 allH Flex WT: ALA Faster3LZT_G 32 SER 2 Faster 6.844 3LZT SD200 1 allH Flex WT: ALA Faster3LZT_G 32 SER 3 Faster 5.293 . . . 3LZT SD200 1 allH Flex WT: ALA Faster3LZT_G 32 ILE 1 Faster 165.5 3LZT SD200 1 allH Flex WT: ALA Faster3LZT_G 32 ILE 2 Faster 253.2 3LZT SD200 1 allH Flex WT: ALA Faster3LZT_G 32 ILE 3 Faster 232.3 . . . 3LZT SD200 1 allH Flex WT: ALA Faster3LZT_G 32 LEU 1 Faster 358 3LZT SD200 1 allH Flex WT: ALA Faster 3LZT_G32 LEU 2 Faster 732.4 3LZT SD200 1 allH Flex WT: ALA Faster 3LZT_G 32LEU 3 Faster 521.4 . . . 3LZT SD200 1 allH Flex WT: ALA Faster 3LZT_G 32PHE 1 Faster 908 3LZT SD200 1 allH Flex WT: ALA Faster 3LZT_G 32 PHE 2Faster 1656 3LZT SD200 1 allH Flex WT: ALA Faster 3LZT_G 32 PHE 3 Faster2137 . . . 3LZT SD200 1 allH Flex WT: ILE Faster 3LZT_G 55 GLY 1 Faster4.34 3LZT SD200 1 allH Flex WT: ILE Faster 3LZT_G 55 ALA 1 Faster 2.7833LZT SD200 1 allH Flex WT: ILE Faster 3LZT_G 55 PRO 1 Faster 7.783 3LZTSD200 1 allH Flex WT: ILE Faster 3LZT_G 55 PRO 2 Faster 23.26 3LZT SD2001 allH Flex WT: ILE Faster 3LZT_G 55 VAL 1 Faster 1530 3LZT SD200 1 allHFlex WT: ILE Faster 3LZT_G 55 VAL 2 Faster 518.5 3LZT SD200 1 allH FlexWT: ILE Faster 3LZT_G 55 VAL 3 Faster 80.25 . . . 3LZT SD200 1 allH FlexWT: ILE Faster 3LZT_G 55 SER 1 Faster 5.179 3LZT SD200 1 allH Flex WT:ILE Faster 3LZT_G 55 SER 2 Faster 4.224 3LZT SD200 1 allH Flex WT: ILEFaster 3LZT_G 55 SER 3 Faster 5.869 . . . 3LZT SD200 1 allH Flex WT: ILEFaster 3LZT_G 55 ILE 9 Faster 84.37 3LZT SD200 1 allH Flex WT: ILEFaster 3LZT_G 55 ILE 10 Faster 0 3LZT SD200 1 allH Flex WT: ILE Faster3LZT_G 55 ILE 11 Faster 0.4476 . . . 3LZT SD200 1 allH Rex WT: ILEFaster 3LZT_G 55 LEU 12 Faster 189.5 3LZT SD200 1 allH Flex WT: ILEFaster 3LZT_G 55 LEU 13 Faster 5.599 3LZT SD200 1 allH Flex WT: ILEFaster 3LZT_G 55 LEU 14 Faster 145.4 . . . 3LZT SD200 1 allH Flex WT:ILE Faster 3LZT_G 55 PHE 1 Faster 9.99E+05 3LZT SD200 1 allH Flex WT:ILE Faster 3LZT_G 55 PHE 2 Faster 5.88E+05 3LZT SD200 1 allH Flex WT:ILE Faster 3LZT_G 55 PHE 3 Faster 3.39E+08 . . .TABLE 1 shows a compilation of single residue ECO's generated by thepresent method. Each line represents an ECO. The columns are labeledaccording to the list described in the section FORMAL DESCRIPTION OFECO'S. Only basic ECO properties are shown. 3LZT is the PDB code for henlysozyme; SD200: 200 steps steepest descent energy minimization; all H:CHARMM force field of Brooks et al. (cited supra), including explicithydrogen atoms and TTS contributions as listed in Table 3; Flex: all“flexible” residues having a rotatable side chain; WT:res: WT residue“res” as observed in 3LZT; 3LZT_G: global minimum energy conformationfor 3LZT. The value “1” in column III refers to the rotamer library ofDe Maeyer et al. (cited supra).

TABLE 2 pos WT GLY ALA PRO VAL SER CYS THR ILE LEU ASP 1 LYS 5.393 4.80872.34 5.448 3.745 100 3.85 7.888 3.487 4.417 2 VAL 3.281 1.826 −1.2190.000 2.181 100 1.597 0.4949 2.905 −1.593 3 PHE 10.12 8.73 100 70.459.215 100 18.02 75.28 6.772 12.07 4 GLY 0.000 −0.6605 188.4 −1.841−3.037 100 −2.98 −1.451 −1.385 −2.933 5 ARG −2.052 −2.955 8.138 −3.549−4.842 100 −4.443 −2.272 −0.7667 −5.503 6 CYS 100 100 100 100 100 0.000100 100 100 100 7 GLU 5.048 4.898 402.3 2.335 2.027 100 2.672 2.61 3.1112.117 8 LEU 5.793 4.472 1220 12.26 3.764 100 26.02 41.45 0.000 12.49 9ALA 0.8005 0.000 2072 62.55 1.565 100 16.86 136.7 97.43 39.37 10 ALA3.083 0.000 364.5 6.618 1.906 100 1.503 9.388 0.1485 −3.284 11 ALA 1.7820.000 541.4 9.069 0.057 100 2.62 9.38 5.411 5.746 12 MET 2.882 6.678878.6 38.15 4.784 100 8.915 36.87 11.29 8.075 13 LYS 6.852 2.989 792.17.331 5.694 100 5.703 40.98 6.538 5.413 14 ARG −2.063 −4.292 1773 −1.594−3.028 100 −4.176 −2.385 −3.231 1.051 15 HIS −6.665 −6.38 1686 6.251−5.89 100 −6.074 29.58 1.38 −3.371 16 GLY 0.000 5.167 3.25E+12 67.439.153 100 27.81 29.08 7.546 8.569 17 LEU 8.529 6.551 1884 17.76 6.421100 4.534 53.78 0.000 7.65 18 ASP 1.38 0.9228 16.84 5.34 −1.209 100−1.72 13.29 6.838 0.000 19 ASN −2.948 −1.913 9.40E+08 7.344 −1.198 1001.66 4.593 −0.8527 0.1945 20 TYR 8.395 6.114 9070 17.52 6.637 100 20.0617.99 7.139 6.815 21 ARG −1.621 −3.643 1.80E+10 18.34 −3.093 100 3.7914.843 4.673 −2.58 22 GLY 0.000 8.119 7.88E+11 26.17 7.271 100 13.9818.86 13.91 8.322 23 TYR 10.12 6.976 2214 5.021 7.194 100 9.943 6.50611.16 8.951 24 SER 1.098 −0.7391 15.88 0.9515 0.000 100 0.7865 0.1856−0.5185 −0.9154 25 LEU 1.436 5.448 72.67 18.89 7.257 100 9.459 16.310.000 10.48 26 GLY 0.000 413.2 72.8 2906 488.8 100 6.72E+04 7422 3759576.5 27 ASN 3.262 1.816 5519 5.026 −0.2109 100 −0.161 7.191 8.994−0.5562 28 TRP 17.26 16.29 1.11E+04 38.91 16.8 100 20.82 42.03 15.4418.27 29 VAL 6.676 5.784 2143 0.000 4.114 100 2.796 6.174 21.26 5.064 30CYS 100 100 100 100 100 0.000 100 100 100 100 31 ALA 1.829 0.000 147658.55 0.5887 100 −1.06 145.7 122.5 9.451 32 ALA 1.944 0.000 670.4 32.753.042 100 25.14 156.9 290.2 58.79 33 LYS 0.3971 −0.4244 2413 30.82 0.060100 4.057 39.72 2.541 −2.799 34 PHE 5.282 4.738 1496 10.49 5.211 1003.614 12.01 8.443 3.627 35 GLU −4.431 −6.218 800.3 21.72 −8.055 100−4.839 129.9 −0.279 17.6 36 SER 3.591 0.926 2.25E+04 57.35 0.000 10025.42 400.6 6537 141.8 37 ASN 1.402 −0.4658 4.35E+08 11.49 0.4319 1006.368 4.874 −0.1753 0.7699 38 PHE 8.244 6.079 5.12E+09 339.2 9.921 10020.63 177.4 5.658 4.183 39 ASN 4.559 2.964 1766 7.849 2.35 100 3.3133.054 5.994 −1.255 40 THR 0.9648 −1.342 1215 14.77 0.8346 100 0.000 28.27.831 5.793 41 GLN 0.5118 −0.599 13.37 2.208 −2.107 100 −3.109 2.815−0.2787 −0.1364 42 ALA 1.887 0.000 190.5 21.59 0.5344 100 4.219 23.3435.57 5.14 43 THR 2.643 −0.4468 13.4 −1.936 0.6394 100 0.000 −2.14 12.754.164 44 ASN 4.585 2.837 3.40E+04 3.555 5.712 100 6.253 2.771 2.1815.291 45 ARG −1.964 −3.796 −6.478 −4.096 −2.206 100 −2.708 −4.016 −1.927−4.39 46 ASN 5.849 5.375 1.72E+04 7.695 7.83 100 9.095 8.181 4.493 3.6747 THR −0.1194 −1.357 8.153 −1.179 −0.2641 100 0.000 −0.4753 −0.4937−0.3787 48 ASP 3.072 1.801 50.67 0.8852 −0.1421 100 0.016 3.45 1.8370.000 49 GLY 0.000 11.01 5.41E+09 226.4 21.75 100 83.87 81.68 20.6226.97 50 SER −2.755 −2.275 78.36 21.58 0.000 100 3.124 40.33 107.7 1.318pos ASN PHE TYR TRP MET GLU GLN LYS ARG HIS 1 5.698 3.197 4.647 5.2951.59 2.688 1.885 0.000 3.654 2.87 2 0.4249 4.991 5.04 7.449 1.689 2.92.913 0.4943 2.241 5.063 3 10.46 0.000 −1.361 3.071 6.441 12.9 10.4912.68 7.076 4.701 4 0.4594 −0.8102 −0.1396 2.266 −0.1753 −4.024 0.8653−0.2155 3.071 −0.050 5 −5.619 7.291 5.133 9.747 −3.488 −5.483 −2.816−2.112 0.000 −2.046 6 100 100 100 100 100 100 100 100 100 100 7 7.2276.817 7.035 11.62 4.23 0.000 1.512 7.179 9.046 6.931 8 10.08 42.79 48.99182.1 20.34 9.464 6.824 70.65 104.2 43.09 9 33.74 173.3 188.3 208.932.29 99.8 51.03 104.8 119.7 67.77 10 0.062 2.425 2.267 3.813 0.6235−0.5627 −0.5718 1.679 5.199 4.627 11 4.65 0.8659 0.5695 5.324 0.69883.71 6.77 3.079 0.6966 3.72 12 6.182 2.992 3.006 104.3 0.000 3.997 1.49125.42 7.127 6.104 13 5.318 6.165 7.002 8.353 5.877 8.446 7.051 0.0008.214 6.398 14 −1.702 −0.8066 −2.032 3.512 −1.955 −2.235 −2.782 −2.8590.000 0.2366 15 −5.431 −0.3028 0.4972 4.455 −6.559 −7.577 −7.105 −6.646−4.206 0.000 16 10.58 5.203 5.124 10.2 10.53 12.23 6.564 5.826 13.767.48 17 4.853 24.15 163.4 133.5 10.07 7.276 8.038 14.88 77.52 6.509 181.235 3.968 3.539 6.153 1.621 −4.022 0.406 3.11 8.335 2.084 19 0.000−3.388 −2.707 0.2491 0.046 3.029 1.476 −2.997 5.017 −0.042 20 5.69−0.249 0.000 5.22 6.818 5.681 6.36 11.83 9.342 7.475 21 −0.8562 −2.193−1.52 1.711 −3.325 0.5204 −0.466 −5.685 0.000 −1.446 22 12.36 6.307 7.3910.98 16.81 17.57 11.47 11.83 20.12 9.072 23 8.435 1.542 0.000 5.19317.71 8.649 7.118 24.57 12.92 9.825 24 0.1757 1.455 1.732 2.719 −0.095−8.693 −6.357 0.085 −0.2637 0.3541 25 7.968 130.8 123.6 99.29 4.8153.108 2.329 34.93 53.48 27.52 26 532 677.8 749.8 277.9 474.8 644.7 15491207 1010 173.6 27 0.000 1.823 2.121 12.23 2.069 0.6558 4.712 0.86096.779 0.090 28 17.27 8.01 5.768 0.000 12.65 21.79 15.79 18.92 10.3 6.54629 5.974 850.3 756.6 498.5 80.01 39.03 28.86 116.8 247.4 81.64 30 100100 100 100 100 100 100 100 100 100 31 5.958 519.2 385.8 1788 53.1446.33 22.31 99.98 76.54 59.34 32 41.19 157.2 201.5 329.6 134.4 237 114.1165.5 212.8 83.42 33 −1.312 −3.276 −2.54 4.067 −1.984 −3.427 −0.41080.000 −0.6887 0.876 34 4.259 0.000 −1.287 5.524 5.494 6.074 6.232 9.1689.371 5.287 35 5.857 107.9 1018 185.4 41.73 0.000 14.69 55.5 133.7 37.6236 99.41 2236 4807 1910 514.1 313 157.4 417.6 5701 910.8 37 0.000 1.6661.261 4.068 0.3995 1.602 2.154 0.6332 −0.6882 4.073 38 1.996 0.000 2.873.547 20.44 22.42 10.46 11.78 17.07 7.017 39 0.000 6.132 6.023 15.133.445 1.299 4.21 4.822 5.066 2.931 40 0.4273 48.97 48.05 145.2 16.3810.83 1.687 24.94 53.96 21 41 −1.165 −1.053 −1.419 2.519 1.174 −1.20.000 0.1047 2.685 −0.263 42 10 6.085 5.325 7.67 0.5431 3.6 5.07 7.775.485 3.054 43 4.081 8.137 8.103 10.83 0.2002 −0.5656 0.3179 1.624 1.6248.908 44 0.000 5.252 4.929 8.913 4.438 4.503 0.8133 −1.346 1.907 5.21945 −5.778 −3.135 −3.884 −1.394 −1.412 −2.193 −3.218 −3.242 0.000 −1.31146 0.000 11.71 10.2 9.011 5.677 5.838 4.908 2.689 7.508 5.452 47 0.51251.691 1.789 5.213 1.474 −0.1983 0.3976 0.7072 1.16 2.168 48 4.053 4.8184.787 8.861 3.455 −1.836 0.033 4.793 7.367 4.381 49 22.52 20.65 20.1822.09 18.71 19.81 15.52 12.29 22.3 16.47 50 6.679 56.61 50.87 58.91−0.1183 4.727 0.7738 3.184 26.9 22.02TABLE 2 shows a compilation of single residue GECO energies generated bythe present method. GECO energies are shown for the first 50 residues ofthe protein structure referred to by the PDB code 3LZT. The GECOenergies shown have been generated from the data in a database of ECO'saccording to TABLE 1, using the min-operator over all rotamers for thesame residue (eq. (10)). The energies are expressed in kcal mol⁻¹. Thefirst column shows the residue position in 3LZT at which all naturallyoccurring amino acid residues (indicated in the first row) have beenmodeled. The second column shows the WT residue at each position. It canbe seen that all GECO energies for WT substitutions are equal to zerosince the reference structure has been calculated on the basis of the WTamino acid sequence. All values for CYS substitutions are arbitrary setto 100 kcal mol except when the WT sequence has a CYS residue at thatposition, in which case the value is set to 0 kcal mol⁻¹.

TABLE 3 GLY ALA PRO VAL SER THR ILE LEU ASP ASN (semi-)buried −0.368−1.786 −22.240 −4.562 1.315 2.224 −5.639 −4.755 10.852 10.106(semi-)exposed −1.508 −1.384 −19.094 −1.964 −0.039 1.748 −2.682 −1.6083.115 7.069 PHE TYR TRP MET GLU GLN LYS ARG HIS (semi-)buried 4.1617.087 7.349 −1.015 7.638 7.158 1.777 7.542 8.864 (semi-)exposed 5.5887.442 9.129 −0.211 1.020 4.730 −1.204 7.592 6.468TABLE 3 shows a set of TTS terms, obtained by the method described inthe section ENERGY FUNCTION, for 19 naturally occurring amino acidresidue types and two classes of solvent exposure. Cys reside types havenot been included in the TTS parameter optimization procedure.

The solvent exposure classes are defined as follows: (semi-) buried:ASA% <25%; (semi-)exposed: 25% 5_(—) ASA%. All values are expressed inkcal mol⁻¹. The computations were performed on 22 protein structures(PDB codes: 3LZT, 1A3C, 1A62, 1AAC, IAH7, IAHO, 1 AIE, 1AMM, 1AVM, 1AWD,1BRT, 1BTK, 10EX, 1CVL, lEDG, IEDM, 1FMK, 1G3P, 1GVP, TIRO, 1IXH, 1KQH).The rotamer,library of De Maeyer et al. (cited supra) and the CHARMMforce field of Brooks et al. (cited supra) with explicit hydrogen atomswere used.

TABLE 4 5\30 GLY ALA PRO VAL SER CYS THR ILE LEU ASP ASN GLY 0.18 −0.180.28 −5.96 0.02 0.54 0.64 −15.91 −6.79 1.71 1.10 ALA −0.93 −1.02 −0.66−7.17 −1.35 −0.49 −0.79 −17.06 −5.77 −0.45 −0.54 PRO −3.83 −3.88 −3.62−10.28 −4.10 −3.52 −3.59 −20.22 −9.17 −3.49 −3.57 VAL −7.71 −8.23 −7.80−6.05 −8.16 0.11 −1.42 −16.01 −5.36 0.58 0.29 SER −0.47 −0.95 −0.47−6.78 −0.55 −0.38 0.00 −16.78 −5.71 0.68 −0.07 CYS −0.06 −0.27 0.09−6.23 −0.56 0.01 −0.10 −16.05 −5.56 −0.10 −0.14 THR −6.99 −7.76 −7.23−12.58 −7.19 −5.47 −5.97 −20.73 −5.39 −1.76 −2.92 ILE −4.98 −4.29 −3.160.19 −4.53 0.30 0.42 0.06 −5.25 0.56 0.45 LEU 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 0.00 0.00 0.00 ASP 1.02 0.12 0.58 −2.05 1.06 0.23 1.49−11.76 −5.20 1.69 0.26 ASN 0.43 0.37 0.61 −5.42 −0.35 0.53 −0.04 −15.04−5.17 0.07 0.37 PHE −5.75 −3.24 −3.00 −3.29 −3.87 0.93 1.98 −7.58 −4.861.47 −1.17 TYR −5.30 −3.57 −3.23 −3.73 −4.46 0.58 6.71 99.82 93.70 7.92−0.92 TRP 3.21 2.99 4.07 −0.92 −0.53 2.89 3.32 −8.50 −3.51 2.81 2.56 MET1.01 0.43 0.78 0.64 0.57 0.49 0.82 0.31 −3.84 1.07 0.83 GLU −3.53 −4.17−3.83 −1.17 −3.90 −1.84 −0.61 −5.25 2.48 −0.84 −0.33 GLN 1.22 0.27 0.931.14 1.27 0.54 2.02 −7.65 −4.90 1.61 −0.26 LYS −3.74 −1.13 −1.44 1.19−2.79 0.87 −0.87 −3.52 −2.23 −1.04 0.52 ARG −1.17 0.15 −0.47 −0.10 −0.74−0.09 −1.09 −4.83 0.01 −1.10 −0.02 HIS −4.32 −3.45 −3.07 −2.90 −3.381.00 2.70 −10.47 −3.84 2.91 −1.96 5\30 PHE TYR TRP MET GLU GLN LYS ARGHIS GLY 0.00 −3.70 −1.48 −3.80 −8.37 −0.87 −6.27 −13.87 1.42 ALA 0.00−2.91 −2.80 −1.66 −9.34 −1.69 −2.23 −7.59 −0.01 PRO 0.00 0.71 20.15−2.75 −11.76 −4.97 −4.69 −10.38 −3.07 VAL 0.00 −1.42 −0.50 −1.11 −6.94−1.30 −1.91 −9.65 −4.21 SER 0.00 −2.79 −2.88 −1.03 −9.20 −1.50 −3.33−5.63 0.26 CYS 0.00 −3.24 −3.34 −1.64 −7.38 −1.34 −1.90 −1.57 0.27 THR0.00 −1.41 0.24 −1.57 −8.43 −2.58 −2.47 −9.15 −5.12 ILE 0.00 −0.78 2.73−1.10 −4.99 −1.34 −1.37 −5.01 0.68 LEU 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 ASP 0.00 −2.95 −3.25 −0.34 −4.83 −1.07 −3.69 −2.37 0.82ASN 0.00 −3.15 −2.80 −0.80 −7.04 −2.50 −1.50 −1.05 1.05 PHE 0.00 0.35−2.20 −1.81 −6.25 −1.91 −1.94 −3.80 0.61 TYR 0.00 2.99 25.24 −2.02 −4.220.62 129.25 −4.77 −0.20 TRP 0.00 −2.88 −2.22 −2.03 −4.03 −0.29 −2.67−1.54 0.99 MET 0.00 −1.85 0.08 −0.24 −4.91 −0.63 −2.14 −0.33 0.64 GLU0.00 −3.26 −2.17 −0.23 −7.02 −3.81 −5.30 −5.41 −0.45 GLN 0.00 −3.41−2.75 −1.38 −3.57 0.89 −2.49 −1.83 0.61 LYS 0.00 −0.48 −0.45 −0.92 −7.34−2.04 −1.93 −1.39 −0.83 ARG 0.00 0.63 −0.40 −1.47 −8.28 −3.35 −2.11−5.12 −0.18 HIS 0.00 0.37 −1.35 −1.67 −8.63 −1.71 −0.78 −6.70 1.24TABLE 4 shows the fuse terms for all possible double mutations atresidue positions 5 and 30 of the B1 domain of protein G (PDB codeIPGA), derived in accordance with eq. (47). Values are expressed in kcalmol⁻¹.

1. A method for analyzing a protein structure, the method being executedin a computer under the control of a program stored in the computer, andcomprising the following steps: A) receiving, by the computer, areference structure for a protein, whereby the said reference structureforms a representation of a three-dimensional structure of the saidprotein, and has a global energy E_(ref), the protein consists of aplurality of residue positions, each carrying a particular referenceamino acid type in a specific reference conformation, and the proteinresidues are classified into a set of modeled residue positions and aset of conformationally fixed residues; B) substituting, by thecomputer, into the reference structure of step (A) a pattern, wherebythe said pattern consists of one or more of the modeled residuepositions defined in step (A), each carrying a particular amino acidresidue type being placed at a specific residue position in thereference structure and adopting a specific conformation, and ’the oneor more amino acid residue types of the said pattern are replacing thecorresponding amino acid residue types present in the referencestructure; C) optimizing, by the computer, the conformation of thereference structure of step (A) being substituted by the pattern of step(B), whereby a suitable protein structure optimization method based on afunction allowing to assess the quality of a global protein structure,or any part thereof, is used in combination with a suitableconformational search method, and the said structure optimization methodis applied to all modeled residue positions defined in step (A) notbeing located at any of the pattern residue positions defined in step(B), with the proviso that said structure optimization method is notapplied to any of said pattern residue positions; D) assessing, by thecomputer, the energetic compatibility of the pattern defined in step (B)within the context of the reference structure defined in step (A) beingstructurally optimized in step (C) with respect to the said pattern, byway of comparing the global energy E(p) of the substituted and optimizedprotein structure with the global energy E_(ref) of the non-substitutedreference structure to obtain an energetic compatibility object energyE_(ECO)(p); and E) generating and outputting, by the computer, to adisplay, a user, a readily accessible memory, another computer on anetwork, or another computer-readable medium, in the form of anenergetic compatibility object (ECO), a value reflecting the saidenergetic compatibility of the pattern together with information forminga full description of the three-dimensional structure comprising thesubstituted pattern; wherein the reference structure of step (A) isreceived from a database of protein structures. 20
 2. A method accordingto claim 1, wherein the reference structure of step (A) furtherrepresents the three-dimensional structure of ligands includingco-factors, ions or water molecules.
 3. A method according to claim 1,wherein the reference structure of step (A) has undergone a furtherenergy optimization step by means of a molecular mechanics gradientmethod.
 4. A method according to claim 1, wherein the referencestructure of step (A) assumes a reference amino acid sequence being thewild type sequence of the said protein.
 5. A method according to claim1, wherein the reference structure of step (A) assumes a reference aminoacid sequence being the sequence of lowest energy for the said protein.6. A method according to claim 1, wherein the reference structure ofstep (A) assumes a reference amino acid sequence being a polyglycine ora polyalanine sequence.
 7. A method according to claim 1, wherein thereference structure of step (A) assumes any possible reference aminoacid sequence including a random or a dummy sequence.
 8. A methodaccording to claim 1, wherein the reference structure of step (A) hasbeen modeled by the structure optimization method used in step (C).
 9. Amethod according to claim 1, wherein the set of modeled residuepositions comprises at least one position.
 10. A method according toclaim 1, wherein the set of modeled residue positions comprises at leastthree positions.
 11. A method according to claim 1, wherein the set ofmodeled residue positions comprises at least five positions.
 12. Amethod according to claim 1, wherein the pattern of step (B) consists ofa single residue position, the said pattern being referred to as asingle residue pattern.
 13. A method according to claim 1, wherein thepattern of step (B) consists of a pair of residue positions, the saidpattern being referred to as a single residue pattern.
 14. A methodaccording to claim 1, wherein each of the pattern residue positionsdefined in step (B) are assigned a naturally occurring amino acidresidue type.
 15. A method according to claim 1, wherein each of thepattern residue positions defined in step (B) are assigned a naturallyor a non-naturally occurring amino acid residue type.
 16. A methodaccording to claim 1, wherein each of the pattern residue positionscarrying a particular amino acid residue type as defined in step (B) areassigned a fixed conformation retrieved from a protein structuredatabase.
 17. A method according to claim 1, wherein each of the patternresidue positions carrying a particular amino acid residue type asdefined in step (B) are assigned a fixed conformation which iscomputer-generated.
 18. A method according to claim 1, wherein each ofthe pattern residue positions carrying a particular amino acid residuetype as defined in step (B) are assigned a fixed conformation which isreceived from a rotamer library, the said pattern being referred to as arotameric pattern, and wherein the said rotamer library isbackbone-dependent, the said library being referred to as a combinedmain-chain/side-chain rotamer library, but wherein the main-chaininformation in the library is only used in order to assign a side-chainconformation to each of the pattern residue positions carrying aparticular amino acid residue type as defined in step (B) which isstructurally compatible with the local main-chain conformation of eachcorresponding residue position in the reference structure of step (A).19. A method according to claim 1, wherein the said conformationalsearch method of step (C) includes a Dead End Elimination (DEE) method.20. A method according to claim 1, wherein the said conformationalsearch method of step (C) includes a Monte Carlo simulation method. 21.A method according to claim 1, wherein the said conformational searchmethod of step (C) includes a genetic algorithm method.
 22. A methodaccording to claim 1, wherein the said conformational search method ofstep (C) includes a mean-field method.
 23. A method according to claim1, wherein the said conformational search method of step (C) includes amolecular mechanics gradient method such as a steepest descent- or aconjugated gradient energy minimization method.
 24. A method accordingto claim 23, wherein the said template residues are not kept fixed butare allowed to vary by at most 2 angströms in root-mean-square deviationcompared to the reference structure of step (A).
 25. A method accordingto claim 23, wherein neither the said pattern residues nor the saidtemplate residues are kept fixed but wherein they are allowed to vary byat most 2 angströms in root-mean-square deviation compared to thesubstituted reference structure of step (B).
 26. A method according toclaim 1, wherein the residue positions to which the structureoptimization is applied as defined in step (C) are rotameric, in thatthey are allowed to assume conformations received from a rotamer libraryin the same way as occurs for rotameric pattern residues.
 27. A methodaccording to claim 1, wherein the said energetic compatibility of thepattern is calculated as the difference in global energy between thesubstituted and optimized protein structure of step (C) and thereference structure of step (A).
 28. A method according to claim 1,wherein the said energetic compatibility of the pattern is calculated asthe absolute global energy of the substituted and optimized proteinstructure of step (C), this being equivalent to selecting in step (A) areference amino acid sequence such as a polyglycine sequence or a dummysequence.
 29. A method according to claim 1, wherein the self-energy ofthe template is ignored.
 30. A method according to claim 1, wherein theself-energy of the protein backbone is ignored.
 31. A method accordingto claim 1, wherein the ECO formed in step (E) further includes adescription of the pattern in terms of the number of pattern residuepositions, the location of each pattern residue position within theprotein structure, the selection of an amino acid residue type at eachposition and the conformation of each pattern residue.
 32. A methodaccording to claim 1, wherein the ECO further includes a description ofor a reference to the protein structure received in step (A).
 33. Amethod according to claims 1, wherein the ECO further includes adescription of or a reference to the modeled residue positions definedin step (A).
 34. A method according to claim 1, wherein the ECO furtherincludes a description of or a reference to the amino acid referencesequence defined in step (A).
 35. A method according to claim 1, whereinthe ECO further includes a description of or a reference to thestructure optimization method used in step (C).
 36. A method accordingto claim 1, wherein the ECO further includes a description of or areference to the rotamer library used to assign the conformation of thepattern residues in accordance with step (B) and/or to define theallowed conformational states of the modeled, non-pattern residues inaccordance with step (C).
 37. A method according to claim 1, wherein theECO further includes a description of or a reference to the costfunction used in step (C).
 38. A method according to claim 1, whereinthe ECO further includes a description of or a reference to thesubstituted and optimized protein structure obtained in step (C).
 39. Amethod according to claim 1, wherein the ECO further includes additionalinformation derived from either the reference structure of step (A) orthe substituted and optimized structure of step (C), such as the localsecondary structure of each modeled residue and/or its accessiblesurface area.
 40. A method according to claim 1, wherein steps (B) to(D) are repeated for different patterns defined in step (B).
 41. Amethod according to claim 40, wherein the patterns are single residuepatterns located at one single residue position selected from the set ofmodeled residue positions of step (A) and are formed by systematicallyconsidering all possible combinations of amino acid residue types indifferent conformations, and wherein the amino acid residue types areselected from a pre-defined list of residue types and the saidconformations are selected from a pre-defined list of conformations. 42.A method according to claim 40, wherein the patterns are double residuepatterns located at two different single residue position selected fromthe set of modeled residue positions of step (A) and are formed bysystematically considering, for the pair of selected residue positions,all possible combinations of amino acid residue types in differentconformations and wherein the amino acid residue types are selected froma pre-defined list of residue types and the said conformations areselected from a pre-defined list of conformations. 25
 43. A methodaccording to claim 41, being repeated until each single residue positionof the set of modeled residue positions of step (A) has been selected asa pattern residue position.
 44. A method according to claim 42, beingrepeated until each possible pair of residue position of the set ofmodeled residue positions of step (A) has been selected as a pair ofpattern residue positions.
 45. A method according to claim 44, whereinthe number of considered double residue patterns is restricted inaccordance with a method taking into account a measure of the distancebetween two residue positions in the reference structure of step (A).46. A method according to claim 45, wherein the number of considereddouble residue patterns is restricted in accordance with a methodfurther taking into account a measure of the size of the amino acidresidue types placed at each of both pattern residue positions.
 47. Amethod according to claim 1, wherein the quantitative measurerepresenting the energetic compatibility of a pattern within the contextof a substituted and adapted reference structure is further transformed.48. A method according to claim 47, wherein the said transformation iseffected by means of a linear, logarithmic or exponential function. 49.A method according to claim 40, wherein the ECO's are further groupedinto GECO's and wherein this grouping operation occurs on the basis ofthe quantitative measure representing the energetic compatibility ofdifferent patterns being located at the same residue position(s) andassuming the same residue type(s).
 50. A fold recognition method toidentify a potential structural relationship between a particular targetamino acid sequence and one or more protein three-dimensionalstructures, the said protein three-dimensional structures being analyzedby a method according to claim
 1. 51. An inverse folding method toidentify a potential structural relationship between a particularprotein three-dimensional structure and one or more known amino acidsequences, wherein the said protein three-dimensional structure isanalyzed by a method according to claim
 1. 52. A protein design methodto identify or generate amino acid sequences which are energeticallycompatible with a particular protein three-dimensional structure,wherein the said protein three-dimensional structure is analyzed by amethod according to claim
 1. 53. A method according to claim 1, whereinthe cost function includes an energetic contribution accounting formain-chain flexibility.
 54. A method according to claim 1, wherein theenergy function includes an energetic contribution for solvation effectswhich is calculated by a method including the assignment of a set ofenergetic solvation terms to each residue type depending on the degreeof solvent exposure of its respective rotamers at the considered residuepositions in the protein structure.
 55. A method according to claim 1,wherein the energy function includes an energetic contribution forsolvation effects which is a type-dependent, topology-specific solvation(TTS) method including establishing a set of energetic parameters foreach of different classes of solvent exposure, and wherein the classesof solvent exposure correspond with buried, semi-buried andsolvent-exposed rotamers, respectively.
 56. A method according to claim55, wherein the table of TTS values is established by: first consideringa set of a number of well-resolved protein structures and assigning toeach residue position of this set a unique index i, then considering, ateach residue position i, all residue types a in all rotameric states r,thereby forming the set of all possible position-type-rotamercombinations i_(r) ^(a), defining by i_(r) ^(w) the WT residue type atposition i in a rotameric state r as observed in the protein comprisingposition i, assigning to each is a solvent exposure class C(i_(r) ^(a))calculating for each is the value E_(pot)(i_(r) ^(a)) as the totalpotential energy the rotamer i_(r) ^(a) experiences within the fixedcontext of the protein structure comprising i, defining a function ΔE(i)according to equation (2)ΔE(i)=min_(r)(E _(pot)(i _(r) ^(w))+E _(TTS)(w,C(i _(r)^(w))))−min_(a)min_(r)(E _(pot)(i _(r) ^(a))+E _(TTS)(a,C(i _(r) ^(a))))  ( eq.2) {E(i) being the difference in energy of the WT residue type inits best possible rotamer and the energetically most favorable residuetype at residue position i also in its best possible rotamer, given aset of TTS parameters, and applying a suitable parameter optimizationmethod to maximize S by varying E_(TTS)(T,C) parameters.
 57. Atype-dependent, topology-specific solvation method for the assignment ofa set of energetic solvation terms to a set of residue types, dependingon the degree of solvent exposure of their respective rotamers at theconsidered residue positions in a protein structure, wherein: first,each residue type at a given residue position in a specific rotamericstate is substituted into the protein structure and its accessiblesurface area (ASA) is calculated, next, a class assignment occurs on thebasis of the percentage ASA of the residue side chain in the proteinstructure compared to the maximal ASA of the same side chain beingshielded from the solvent only by its own main-chain atoms, and finallyan appropriate type- and topology-specific energy, E_(TTS)(T,C), for theconsidered pattern element is retrieved from a table of TTS values,using the pattern type (T) and class (C) indices.
 58. A method accordingto claim 57, wherein the table of TTS values is established by: firstconsidering a set of a number of well-resolved protein structures andassigning to each residue position of this set a unique index i, thenconsidering, at each residue position i, all residue types a in allrotameric states r, thereby forming the set of all possibleposition-type-rotamer combinations i_(r) ^(a), defining by i_(r) ^(w)the WT residue type at position i in a rotameric state r as observed inthe protein comprising position i, assigning to each is a solventexposure class C(i_(r) ^(a)), calculating for each i_(r) ^(a) the valueE_(pot)(i_(r) ^(a)) as the total potential energy the rotamer isexperiences within the fixed context of the protein structure comprisingi, defining a function ΔE(i) according to equation (2)ΔE(i)=min_(r)(E _(pot)(i _(r) ^(w))+E _(TTS)(w,C(i _(r)^(w))))−min_(a)min_(r)(E _(pot)(i _(r) ^(a))+E _(TTS)(a,C(i _(r) ^(a))))  (eq.2) ×E(i) being the difference in energy of the WT residue type inits best possible rotamer and the energetically most favorable residuetype at residue position i also in its best possible rotamer, given aset of TTS parameters, and applying a suitable parameter optimizationmethod to maximize S by varying E_(TTS)(T,C) parameters.
 59. A nucleicacid sequence encoding a protein sequence analyzed by a method accordingto claim
 1. 60. An expression vector comprising the nucleic acidsequence of claim
 59. 61. A host cell comprising the nucleic acidsequence of claim
 59. 62. A pharmaceutical composition comprising atherapeutically effective amount of a protein sequence analyzed by amethod according to claim 1 and a pharmaceutically acceptable carrier.63. A method of treating a disease in a mammal, comprising administeringa pharmaceutical composition according to claim 62 to said mammal inneed thereof
 64. An ECO obtainable by a method according to claim
 1. 65.A database in the form of a data structure comprising a set of ECO'sobtainable by a method according to claim
 1. 66. A computing device forcomputing a type-dependent, topology-specific solvation for theassignment of a set of energetic solvation terms to a set of residuetypes, depending on the degree of solvent exposure of their respectiverotamers at the considered residue positions in a protein structure,comprising: Means for substituting each residue type at a given residueposition in a specific rotameric state into the protein structure andits accessible surface area (ASA) is calculated, Means for a classassigning on the basis of the percentage ASA of the residue side chainin the protein structure compared to the maximal ASA of the same sidechain being shielded from the solvent only by its own main-chain atoms,and Means for retrieving an appropriate type- and topology-specificenergy, E_(TTS)(T,C), for the considered pattern element from a table ofTTS values in the form of a data structure, using the pattern type (T)and class (C) indices.
 67. A computer program product to be utilizedwith a computer system having a memory and a processor for computing atype-dependent, topology-specific solvation for the assignment of a setof energetic solvation terms to a set of residue types, depending on thedegree of solvent exposure of their respective rotamers at theconsidered residue positions in a protein structure, comprising:Instruction means for substituting each residue type at a given residueposition in a specific rotameric state into the protein structure andits accessible surface area (ASA) is calculated, Instruction means for aclass assigning on the basis of the percentage ASA of the residue sidechain in the protein structure compared to the maximal ASA of the sameside chain being shielded from the solvent only by its own main-chainatoms, and Instruction means for retrieving an appropriate type- andtopology-specific energy, E_(TTS)(T,C), for the considered patternelement from a table of TTS values in the form of a data structure,using the pattern type (T) and class (C) indices.
 68. A computerreadable data carrier comprising an executable computer program productin accordance with claim
 67. 69. A computer readable data carriercomprising an executable computer program product for executing themethod of claim
 1. 70. A method comprising the steps of inputting adescription of a least a protein reference structure at a near location;transmitting the description to a remote processing engine running acomputer program for carrying out any of the methods of claim 1;receiving at a near location from the remote processing engine an outputof a method in accordance with claim 1.